Reduce H.265 High-Res Encoding Costs by over 80% with AWS Graviton2

April 26, 2022

5 minute read time.

The demand for high-resolution, high-definition video content is exploding. Growth in camera resolution, the size of devices (including smartphones, tablets and TVs), and in network bandwidth drives this demand. To save bandwidth and storage space, these video streams are often compressed using newer codecs like H.265. And while more efficient at compression, these codecs require significantly higher compute resources. This paper describes the work done by Videolan/FFlabs and AWS teams to optimize video encode processing for H.265 on Arm-based server platforms in the cloud.

Background

Over the last few years, there has been a steady growth in both generation and consumption of high-resolution content. Better device cameras and higher-resolution screens for viewing content has driven this growth. Newer codecs like H.265/HEVC, VP9 or AV1 are more than 50% efficient at compressing such higher-resolution content compared to legacy codecs like H.264, as table 1 shows.

Resolutions	Required Bandwidth	Required Bandwidth
	H264	H265
1280×720(HD)	3Mbps	1.5Mbps
1920X1080(FHD)	6Mbps	3Mbps
3840×2160 (UHD)	25Mbps	12Mbps
4096×2160 (4K)	32Mbps	15Mbps

Table 1: Required bandwidth for high-resolution videos for H264 and H265

This compression efficiency comes with a much greater need for compute cycles, which can be 10x higher compared to H.264 compression. Typical processors used in the cloud like the AWS Graviton2 or Intel Xeon family often use vector-processing capabilities like Neon or SSE/AVX instructions to accelerate video processing. Over the last year, there have been significant efforts to optimize the open-source libx265 implementation of the H.265 encoder. On Arm Neoverse-based platforms like the AWS Graviton2, which supports Neon instructions, both Videolan and AWS have contributed to this effort. The result is an impressive performance uplift from 1.4x to 3x in certain scenarios, which are described in detail in the next section.

The optimized code is available at https://bitbucket.org/multicoreware/x265_git/

Performance results

We benchmarked the latest snapshot of libx265 open-source codec https://bitbucket.org/multicoreware/x265_git/ on comparable Graviton and competitive instances on AWS.

c6g.metal – 64 Arm Neoverse N1 cores (64 vCPU)
c5.metal – 48 Intel Xeon cores, 96 threads (96 vCPU)

We used the same video in various resolutions and encoding presets to see the impact of performance under different scenarios.

AWS Graviton2 performance uplift from Arm Neon optimizations

We benchmarked libx265 on C6g bare-metal before and after the Neon optimizations, to measure the uplift in performance. For fast and medium presets, we found an FPS (frames per second) uplift of ~40% across different resolutions. Whereas for slow presets, the FPS uplift was close to ~100%.

x265 performance and speedup - ultrafast preset -on C6g

x265 performance and speedup - medium preset - on C6g

x265 performance and speedup - veryslow preset - on C6g

AWS Graviton2 to Intel Xeon performance comparison

We ran multiple instances of the libx265 encoder to encode multiple jobs at the same time. And we spread these jobs evenly across multiple cores on both C6g bare-metal and C5 bare-metal instances.

For the C5 bare-metal instance, performance scaled linearly until 48 vCPUs. After 48 vCPUs the second HW thread on each physical core is used, and gains are no longer linear – in some cases, they flatten out.

On the other hand, C6g bare-metal instances show good scaling all the way up to the full 64 cores with no degradation in performance.

x265 socket scaling - c5 vs c6g

At the full socket level, the C6g instances performed 80% better compared to the C5 instances. And at roughly half the cost, the Arm-based instances provide an unbeatable 3x cost reduction for running H.265.

Stress-test: Video-on-Demand at highest compression

A very popular use-case for encoding in software is Video-on-Demand, where videos are pre-compressed to the highest possible ratio but without compromise to the video quality. We used the configuration from "Benwaggoner HEVC encoding challenge" with Netflix input file Sol Levante's 8-bit SDR 1080p.y4m

We observe that the Graviton2-based instances provide an exceptional performance advantage in this scenario. They compress the entire video in 1/4^th of the time taken by the comparable x86-based instances which translates into 8x lower costs.

Cost and time to encode - C6g vs C5

Conclusion

With the growth in high-resolution video content, use of higher compression codecs like H.265 for video-streaming applications in the cloud becomes essential. There have been several efforts to optimize the libx265 codec for Arm Neoverse platforms. Encoding high-resolution videos using these latest optimizations provides up to 2x performance uplift over previous implementations. And at a system-level, AWS Graviton2 bare-metal instances provide both better scaling and 80% higher overall performance at about a third of the cost compared to other similar bare-metal instances. Any businesses with a significant monthly video encoding bill should check out AWS Graviton2.

Check out AWS Graviton2

Benchmarking configuration:

Experiment configurations:

Instances: AWS C5 and C6g Bare Metal

OS: Ubuntu 20.04

GCC: 9.3

x265 branch, x265 with Neon optimizations.

https://bitbucket.org/multicoreware/x265_git/commits/4bf31dc15fb6d1f93d12ecf21fad5e695f0db5c0

Baseline experimental numbers in the following spreadsheet.
score_x265_encoding_AWS.xlsx

Videos of choice:

We can select five video files from Google YouTube UGC dataset with different resolutions from the following link.

https://console.cloud.google.com/storage/browser/ugc-dataset/original_videos/Sports

File names:

Sports_360P-02c3.mkv

Sports_480P-0623.mkv

Sports_720P-00a1.mkv

Sports_1080P-0063.mkv

Sports_2160P-0455.mkv

Quality of encoding

We can select the following preset for regression
ultrafast superfast veryfast faster fast medium slow slower veryslow

Threads/instance choices:

frame-threads being 1 and instance ranging from 1 to the number of vCPUs on the instance.

A sample command template:

./x265 --preset $preset --frames 50 $VIDEO --input-res $INPUTRES --fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ',' --log-level error --csv csv_outfile.265

For 360P video:

./x265 --preset $preset --frames 50 Sports_360P-02c3.mkv input-res 640x360 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 480P video:

./x265 --preset $preset --frames 50 Sports_480P-0623.mkv input-res 720x576 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 720P video:

./x265 --preset $preset --frames 50 Sports_720P-00a1.mkv input-res 1280x960 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 1080P video:

./x265 --preset $preset --frames 50 Sports_1080P-0063.mkv input-res 1920x1080 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 2160P video:

./x265 --preset $preset --frames 50 Sports_2160P-0455.mkv input-res 3840x2160 fps 24 --output $count_outfile.265 --frame-threads 1 --no-wpp --pools ','

"Benwaggoner HEVC encoding challenge" – Stress-test

Command used:

./x265/build/aarch64-linux/x265 --input SolLevante_SDRv2_1080p24_8bit.y4m --level-idc 4.0 --preset placebo --subme 7 --sar 1 --pools +,- --ref 5 --bframes 16 -F 1 --hme --hme-search 2,3,4 --fades --frame-dup --dup-threshold 50 --tune animation --tskip --cu-lossless --rd-refine --multi-pass-opt-analysis --multi-pass-opt-distortion --keyint 120 --rc-lookahead 120 --bitrate 1000 --vbv-maxrate 4000 --vbv-bufsize 12000 --hrd --aud --colorprim bt709 --transfer bt709 --colormatrix bt709 -o SolLevante_SDR-1080p_1-4M_ultraplacebo_p3.hevc --psnr --ssim --pmode

with Netflix input file Sol Levante's 8-bit SDR 1080p .y4m

https://1drv.ms/u/s!AlvIQZWsyeO-k9llZI15s0x3uwd_nQ?e=PlqcNz

References

https://www.marketsandmarkets.com/Market-Reports/intelligent-video-analytics-market-778.html
https://www.polarismarketresearch.com/industry-analysis/video-analytics-market
"Benwaggoner HEVC encoding challenge" https://forum.doom9.org/showthread.php?t=175776

0 comments
0 members are here

Servers and Cloud Computing blog

Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

Na Li

This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
- April 7, 2025
Arm CMN S3: Driving CXL storage innovation

John Xavier Lionel

CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
- February 24, 2025
Streamline Arm adoption with GitHub Copilot and Arm64 Runners

Michael Gamble

The Arm for GitHub Copilot extension is here to change the way developers approach architecture migration.
- February 19, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog