The demand for high-resolution, high-definition video content is exploding. Growth in camera resolution, the size of devices (including smartphones, tablets and TVs), and in network bandwidth drives this demand. To save bandwidth and storage space, these video streams are often compressed using newer codecs like H.265. And while more efficient at compression, these codecs require significantly higher compute resources. This paper describes the work done by Videolan/FFlabs and AWS teams to optimize video encode processing for H.265 on Arm-based server platforms in the cloud.
Over the last few years, there has been a steady growth in both generation and consumption of high-resolution content. Better device cameras and higher-resolution screens for viewing content has driven this growth. Newer codecs like H.265/HEVC, VP9 or AV1 are more than 50% efficient at compressing such higher-resolution content compared to legacy codecs like H.264, as table 1 shows.
Table 1: Required bandwidth for high-resolution videos for H264 and H265
This compression efficiency comes with a much greater need for compute cycles, which can be 10x higher compared to H.264 compression. Typical processors used in the cloud like the AWS Graviton2 or Intel Xeon family often use vector-processing capabilities like Neon or SSE/AVX instructions to accelerate video processing. Over the last year, there have been significant efforts to optimize the open-source libx265 implementation of the H.265 encoder. On Arm Neoverse-based platforms like the AWS Graviton2, which supports Neon instructions, both Videolan and AWS have contributed to this effort. The result is an impressive performance uplift from 1.4x to 3x in certain scenarios, which are described in detail in the next section.
The optimized code is available at https://bitbucket.org/multicoreware/x265_git/
We benchmarked the latest snapshot of libx265 open-source codec https://bitbucket.org/multicoreware/x265_git/ on comparable Graviton and competitive instances on AWS.
We used the same video in various resolutions and encoding presets to see the impact of performance under different scenarios.
We benchmarked libx265 on C6g bare-metal before and after the Neon optimizations, to measure the uplift in performance. For fast and medium presets, we found an FPS (frames per second) uplift of ~40% across different resolutions. Whereas for slow presets, the FPS uplift was close to ~100%.
We ran multiple instances of the libx265 encoder to encode multiple jobs at the same time. And we spread these jobs evenly across multiple cores on both C6g bare-metal and C5 bare-metal instances.
For the C5 bare-metal instance, performance scaled linearly until 48 vCPUs. After 48 vCPUs the second HW thread on each physical core is used, and gains are no longer linear – in some cases, they flatten out.
On the other hand, C6g bare-metal instances show good scaling all the way up to the full 64 cores with no degradation in performance.
At the full socket level, the C6g instances performed 80% better compared to the C5 instances. And at roughly half the cost, the Arm-based instances provide an unbeatable 3x cost reduction for running H.265.
A very popular use-case for encoding in software is Video-on-Demand, where videos are pre-compressed to the highest possible ratio but without compromise to the video quality. We used the configuration from "Benwaggoner HEVC encoding challenge" with Netflix input file Sol Levante's 8-bit SDR 1080p.y4m
We observe that the Graviton2-based instances provide an exceptional performance advantage in this scenario. They compress the entire video in 1/4th of the time taken by the comparable x86-based instances which translates into 8x lower costs.
With the growth in high-resolution video content, use of higher compression codecs like H.265 for video-streaming applications in the cloud becomes essential. There have been several efforts to optimize the libx265 codec for Arm Neoverse platforms. Encoding high-resolution videos using these latest optimizations provides up to 2x performance uplift over previous implementations. And at a system-level, AWS Graviton2 bare-metal instances provide both better scaling and 80% higher overall performance at about a third of the cost compared to other similar bare-metal instances. Any businesses with a significant monthly video encoding bill should check out AWS Graviton2.
Check out AWS Graviton2
Instances: AWS C5 and C6g Bare Metal
OS: Ubuntu 20.04
x265 branch, x265 with Neon optimizations.
Baseline experimental numbers in the following spreadsheet. score_x265_encoding_AWS.xlsx
Videos of choice:
We can select five video files from Google YouTube UGC dataset with different resolutions from the following link.
Quality of encoding
We can select the following preset for regressionultrafast superfast veryfast faster fast medium slow slower veryslow
frame-threads being 1 and instance ranging from 1 to the number of vCPUs on the instance.
A sample command template:
./x265 --preset $preset --frames 50 $VIDEO --input-res $INPUTRES --fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ',' --log-level error --csv csv_outfile.265
For 360P video:
./x265 --preset $preset --frames 50 Sports_360P-02c3.mkv input-res 640x360 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','
For 480P video:
./x265 --preset $preset --frames 50 Sports_480P-0623.mkv input-res 720x576 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','
For 720P video:
./x265 --preset $preset --frames 50 Sports_720P-00a1.mkv input-res 1280x960 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','
For 1080P video:
./x265 --preset $preset --frames 50 Sports_1080P-0063.mkv input-res 1920x1080 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','
For 2160P video:
./x265 --preset $preset --frames 50 Sports_2160P-0455.mkv input-res 3840x2160 fps 24 --output $count_outfile.265 --frame-threads 1 --no-wpp --pools ','
"Benwaggoner HEVC encoding challenge" – Stress-test
./x265/build/aarch64-linux/x265 --input SolLevante_SDRv2_1080p24_8bit.y4m --level-idc 4.0 --preset placebo --subme 7 --sar 1 --pools +,- --ref 5 --bframes 16 -F 1 --hme --hme-search 2,3,4 --fades --frame-dup --dup-threshold 50 --tune animation --tskip --cu-lossless --rd-refine --multi-pass-opt-analysis --multi-pass-opt-distortion --keyint 120 --rc-lookahead 120 --bitrate 1000 --vbv-maxrate 4000 --vbv-bufsize 12000 --hrd --aud --colorprim bt709 --transfer bt709 --colormatrix bt709 -o SolLevante_SDR-1080p_1-4M_ultraplacebo_p3.hevc --psnr --ssim --pmode
with Netflix input file Sol Levante's 8-bit SDR 1080p .y4m