Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog Arm-based cloud instances outperform x86 instances by up to 64% on VP9 encoding
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Video Compression Standard
  • Server and Infrastructure
  • Neoverse
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Arm-based cloud instances outperform x86 instances by up to 64% on VP9 encoding

Yichen Jia
Yichen Jia
November 6, 2023
6 minute read time.

Video streaming is a key cloud application that attracts intensive attention from the whole industry. It is worth noting that COVID-19 global pandemic has shown a clear impact to video industry trends and challenges. For instance, according to the research report from bitmovin[1], live-streaming at scale and low latency have become the top two areas for innovation, while video quality is also a top concern.

To adapt to this trend and meet new customer requirements, such as lower latencies, higher resolution, higher bit-depth, etc., popular video codecs have evolved from x264 to x265, VP9, and AV1. In particular, VP9 has been widely deployed in production, which occupies 17% in live-streaming and 10% in VoD encoding [1]. Compared with x264, VP9 is believed to provide better compression efficiency, enabling higher video quality at lower bit rates, which is particularly beneficial for streaming services and users with limited bandwidth [2].  

To better enable customers to deploy VP9 on increasingly popular Arm-based cloud servers, Arm, together with its partner VectorCamp, have contributed numerous open-sourced optimization work and achieved significant performance improvements. These improvements are mainly achieved by leveraging Arm Neon technology [5], an advanced SIMD architecture extension for Arm processors, from 2021-2023.  

Specifically, VectorCamp [8] has contributed Neon implementations of the high bit-depth functions, starting from all the FDCT/FHT functions for various sizes (4x4 up to 32x32), refactoring and simplifying the DCT code and its helper functions in the process. In addition, VectorCamp provided optimized versions of the temporal filter, color quantization, diamond-search SAD and non-SDOT implementations for high bit-depth variance, subtract block and SAD functions. 

For standard bit-depth, Arm has contributed a series of Armv8.4-A DotProd Neon optimizations to accelerate standard bit-depth convolution, sum-of-absolute-difference (SAD) and variance functions [9]. Arm has also improved the existing Neon implementations of Armv8.0 SAD, SAD4D, variance and sub-pixel variance, plus a host of helper functions for multi-vector reduction and matrix transposition. In Armv8.6-A, Arm contributed I8MM implementation of the convolution algorithms, and added new Neon code for the mean-squared-error (MSE), block error and intra-frame block predictor functions. 

For high bit-depth, Arm has optimized the SAD and SAD4D Neon paths, contributed new Neon paths for Hadamard transforms, minmax, averaging, satd, block error and intra-frame block predictor functions.  And Arm ported our standard bit-depth sub-pixel variance optimizations to high bit-depth paths. 

A full list of contributions by Konstantinos Margaritis (VectorCamp), Jonathan Wright (Arm), Salomé Thirot (Arm), George Steed (Arm), and Gerda Zsejke More (Arm), and the teams can be found at libvpx log page [10].    

To illustrate the performance uplift of these NEON optimizations, and the performance and TCO benefits running VP9 on Arm servers, we have conducted intensive experiments on Amazon AWS EC2 platforms.  

System Configurations 

We measured the performance on AWS C6g, C7g and C6a 16xlarge instances, as listed in Table 1. C6g instances are based on AWS Graviton2, which uses Arm Neoverse N1 cores. While C7g instances are based on AWS Graviton3, which uses Arm Neoverse V1 cores. All the instances run Ubuntu 22.04 and GCC 13.1. We compared the original VP9 branch, 2eb934d in 2021 [6], and Optimized branch, 60ee1b1 in 2023 [7]. The videos we used for the tests are 8/10-Bit 4K Bosphorus downloaded from ultravideo [3] and 8/10-Bit 1080P 4Ever downloaded from HEVC Dash [4]. 

AWS Instances  Architecture  vCPUs  DRAM(GB)  On-demand Pricing 
C6g.16xlarge  Arm  64  128  $2.176/h 
C7g.16xlarge  Arm  64  128  $2.312/h 
C6a.16xlarge  X86  64  128  $2.448/h 

Table 1: AWS VM Configuration and Pricing information 

Performance Evaluation 

We conducted two sets of experiments to illustrate the performance improvement on Arm from our optimizations. We also show the performance advantage of Arm over competitive offerings. All the performance results presented below are full-socket performance when the same number of CPU cores (64 vCPUs) are used. --cpu-used in all the figures means the VP9 parameter that determines the speed and video quality of the compression. A lower value will provide higher video quality, which is typically used in Video on Demand (VoD) scenario, whereas a higher value will provide higher encoding speed at the expense of having some impact on video quality or rate control accuracy, which is typically used in Live Streaming mode.  

Performance Improvement on Arm 

As for the performance improvement of the latest VP9 commit, we benchmarked with 8Bit 1080P and 4K videos, respectively. As shown in Figure 1, for 8-Bit 1080P video, compared with Original version, the Optimized one can increase the FPS by 37.2% on C6g, and by 37.8% on C7g. Similarly, as Figure 2 shows, the FPS increase by 21.3% on C6g and 15.3% on C7g for 8-Bit 4K input.  

 Performance Improvement for 8Bit 1080P

  Figure 1: Performance for 8-Bit 1080p  

Performance for 8Bit 4K

Figure 2: Performance Improvement for 8-Bit 4K 

As for the 10-bit inputs, the performance boost is more significant. For instance, for 10-bit 1080P video, FPS increases by 70.7% and 61.6% on C6g and C7g, respectively, as Figure 3 shows. For 10-Bit 4K video, FPS increase from 5.12 to 12.39, or by 142.0%, on C6g, and from 8.95 to 18.61, or by 108.0%, on C7g.   

Performance for 10Bit videos

Figure 3: Performance Improvement for 10-Bit Videos

Overall, for all the cases we mention above, the performance uplift we observed is from 15.3% to 142.0% across different cases. This means the Neon optimizations to VP9 conducted by Arm, and its partners, can boost performance significantly.  

Competitive Analysis 

To compare the full socket performance on Arm and x86, we chose to showcase the FPS on C6a (AMD-Milan) and C7g (Arm-Neoverse).  

For 8-Bit 1080P video, Arm can achieve 30.6% higher FPS when –cpu-used is 4, and 64.0% higher FPS when --cpu-used is 8, This is based on comparing rm-based C7g to x86-based C6a, as Figure 4 shows. As for 8-Bit 4K video (Figure 5), the FPS on C7g is higher than C6a by 43.1% and 50.0% when --cpu-used being 4 and 8, respectively. 

Performance for 8-Bit 1080P video

Figure 4: Performance comparison for 8-Bit 1080P Video

Performance for 8-Bit 4K video

Figure 5: Perf for 8-Bit 4K Video 

For 10-bit videos, the boost is more significant. For instance, Figure 6 shows that Arm C7g provides up to 63.2% and 42.4% higher FPS for 1080P and 4K videos respectively, than x86 C6a.  

Competitive comparison for 10-Bit Videos

Figure 6: Competitive Analysis for 10-Bit Videos 

Overall, for all the cases we mentioned above, the performance benefits we observed on Arm-based C7g EC2 instances ranges from 30.6% to 64.0%, compared with x86-based C6a.  

Summary 

Based on our evaluation, deploying VP9 is quite promising on Arm platforms. With continuous optimizations to VP9 in the past two years, it achieves significant performance uplift, by up to 142.0% on Arm instances compared with original branch without Neon optimizations. Compared with x86 based C6a, the latest VP9 codec can provide 64.0% higher FPS on Arm based C7g.

More Neoverse Blogs

References 

[1] The 6th Annual Bitmovin Video Developer Report, Shaping the future of video 2022/2023, https://bitmovin.com/wp-content/uploads/2022/12/bitmovin-6th-video-developer-report-2022-2023.pdf 
[2] VP9 Codec: The Complete Guide to Google’s Open Source Video Codec, https://bitmovin.com/vp9-codec-status-quo/  
[3] Ultravideo, https://ultravideo.fi/#testsequences  
[4] Ultra High Definition HEVC DASH Data Set, https://download.tsi.telecom-paristech.fr/gpac/dataset/dash/uhd/  
[5] Arm NEON, https://www.arm.com/technologies/neon  
[6] VP9 Source Code, Original version, https://chromium.googlesource.com/webm/libvpx/+/2eb934d9c1fb4a460e3f03c8578b7b4f4f195784  
[7] VP9 Source Code, Optimized version, https://chromium.googlesource.com/webm/libvpx/+/60ee1b149bd3de8e53857ff5a9f7f19d9398144e  
[8] VectorCamp, https://www.vectorcamp.gr  
[9] DotProduct by Arm, https://developer.arm.com/documentation/102651/a/Use-case--improving-VP9-performance 
[10] Libvpx log, https://chromium.googlesource.com/webm/libvpx/+log  

Anonymous
Servers and Cloud Computing blog
  • Migrating our GenAI pipeline to AWS Graviton powered by Arm Neoverse: A 40% cost reduction story

    Hrudu Shibu
    Hrudu Shibu
    This blog post explains how Esankethik.com, an IT and AI solutions company, successfully migrated its internal GenAI pipeline to AWS Graviton Arm64.
    • August 28, 2025
  • Using GitHub Arm-hosted runners to install Arm Performance Libraries

    Waheed Brown
    Waheed Brown
    In this blog post, learn how Windows developers can set up and use Arm-hosted Windows runners in GitHub Action.
    • August 21, 2025
  • Distributed Generative AI Inference on Arm

    Waheed Brown
    Waheed Brown
    As generative AI becomes more efficient, large language models (LLMs) are likewise shrinking in size. This creates new opportunities to run LLMs on more efficient hardware, on cloud machines doing AI inference…
    • August 18, 2025