Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
  • Groups
    • Research Collaboration and Enablement
    • DesignStart
    • Education Hub
    • Innovation
    • Open Source Software and Platforms
  • Forums
    • AI and ML forum
    • Architectures and Processors forum
    • Arm Development Platforms forum
    • Arm Development Studio forum
    • Arm Virtual Hardware forum
    • Automotive forum
    • Compilers and Libraries forum
    • Graphics, Gaming, and VR forum
    • High Performance Computing (HPC) forum
    • Infrastructure Solutions forum
    • Internet of Things (IoT) forum
    • Keil forum
    • Morello Forum
    • Operating Systems forum
    • SoC Design and Simulation forum
    • 中文社区论区
  • Blogs
    • AI and ML blog
    • Announcements
    • Architectures and Processors blog
    • Automotive blog
    • Graphics, Gaming, and VR blog
    • High Performance Computing (HPC) blog
    • Infrastructure Solutions blog
    • Innovation blog
    • Internet of Things (IoT) blog
    • Operating Systems blog
    • Research Articles
    • SoC Design and Simulation blog
    • Smart Homes
    • Tools, Software and IDEs blog
    • Works on Arm blog
    • 中文社区博客
  • Support
    • Arm Support Services
    • Documentation
    • Downloads
    • Training
    • Arm Approved program
    • Arm Design Reviews
  • Community Help
  • More
  • Cancel
Arm Community blogs
Arm Community blogs
Infrastructure Solutions blog Reduce H.265 High-Res Encoding Costs by over 80% with AWS Graviton2
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI and ML blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded blog

  • Graphics, Gaming, and VR blog

  • High Performance Computing (HPC) blog

  • Infrastructure Solutions blog

  • Internet of Things (IoT) blog

  • Operating Systems blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • h.265
  • aws
  • Video Compression Standard
  • Graviton2
  • Server and Infrastructure
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Reduce H.265 High-Res Encoding Costs by over 80% with AWS Graviton2

Yichen Jia
Yichen Jia
April 26, 2022
5 minute read time.

The demand for high-resolution, high-definition video content is exploding. Growth in camera resolution, the size of devices (including smartphones, tablets and TVs), and in network bandwidth drives this demand. To save bandwidth and storage space, these video streams are often compressed using newer codecs like H.265. And while more efficient at compression, these codecs require significantly higher compute resources. This paper describes the work done by Videolan/FFlabs and AWS teams to optimize video encode processing for H.265 on Arm-based server platforms in the cloud.

Background

Over the last few years, there has been a steady growth in both generation and consumption of high-resolution content. Better device cameras and higher-resolution screens for viewing content has driven this growth. Newer codecs like H.265/HEVC, VP9 or AV1 are more than 50% efficient at compressing such higher-resolution content compared to legacy codecs like H.264, as table 1 shows.

Resolutions

Required Bandwidth

Required Bandwidth

H264

H265

1280×720(HD)

3Mbps

1.5Mbps

1920X1080(FHD)

6Mbps

3Mbps

3840×2160 (UHD)

25Mbps

12Mbps

4096×2160 (4K)

32Mbps

15Mbps

Table 1: Required bandwidth for high-resolution videos for H264 and H265

This compression efficiency comes with a much greater need for compute cycles, which can be 10x higher compared to H.264 compression. Typical processors used in the cloud like the AWS Graviton2 or Intel Xeon family often use vector-processing capabilities like Neon or SSE/AVX instructions to accelerate video processing. Over the last year, there have been significant efforts to optimize the open-source libx265 implementation of the H.265 encoder.  On Arm Neoverse-based platforms like the AWS Graviton2, which supports Neon instructions, both Videolan and AWS have contributed to this effort.  The result is an impressive performance uplift from 1.4x to 3x in certain scenarios, which are described in detail in the next section.

The optimized code is available at https://bitbucket.org/multicoreware/x265_git/

Performance results

We benchmarked the latest snapshot of libx265 open-source codec https://bitbucket.org/multicoreware/x265_git/ on comparable Graviton and competitive instances on AWS.

  • c6g.metal – 64 Arm Neoverse N1 cores (64 vCPU)
  • c5.metal – 48 Intel Xeon cores, 96 threads (96 vCPU)

We used the same video in various resolutions and encoding presets to see the impact of performance under different scenarios.

AWS Graviton2 performance uplift from Arm Neon optimizations

We benchmarked libx265 on C6g bare-metal before and after the Neon optimizations, to measure the uplift in performance. For fast and medium presets, we found an FPS (frames per second) uplift of ~40% across different resolutions. Whereas for slow presets, the FPS uplift was close to ~100%.

x265 performance and speedup - ultrafast preset -on C6g

x265 performance and speedup - medium preset - on C6g

x265 performance and speedup - veryslow preset - on C6g

AWS Graviton2 to Intel Xeon performance comparison

We ran multiple instances of the libx265 encoder to encode multiple jobs at the same time. And we spread these jobs evenly across multiple cores on both C6g bare-metal and C5 bare-metal instances.

For the C5 bare-metal instance, performance scaled linearly until 48 vCPUs. After 48 vCPUs the second HW thread on each physical core is used, and gains are no longer linear – in some cases, they flatten out.

On the other hand, C6g bare-metal instances show good scaling all the way up to the full 64 cores with no degradation in performance.

x265 socket scaling - c5 vs c6g

At the full socket level, the C6g instances performed 80% better compared to the C5 instances. And at roughly half the cost, the Arm-based instances provide an unbeatable 3x cost reduction for running H.265.

Stress-test: Video-on-Demand at highest compression

A very popular use-case for encoding in software is Video-on-Demand, where videos are pre-compressed to the highest possible ratio but without compromise to the video quality. We used the configuration from "Benwaggoner HEVC encoding challenge" with Netflix input file Sol Levante's 8-bit SDR 1080p.y4m

We observe that the Graviton2-based instances provide an exceptional performance advantage in this scenario. They compress the entire video in 1/4th of the time taken by the comparable x86-based instances which translates into 8x lower costs.

Cost and time to encode - C6g vs C5

Conclusion

With the growth in high-resolution video content, use of higher compression codecs like H.265 for video-streaming applications in the cloud becomes essential. There have been several efforts to optimize the libx265 codec for Arm Neoverse platforms. Encoding high-resolution videos using these latest optimizations provides up to 2x performance uplift over previous implementations. And at a system-level, AWS Graviton2 bare-metal instances provide both better scaling and 80% higher overall performance at about a third of the cost compared to other similar bare-metal instances. Any businesses with a significant monthly video encoding bill should check out AWS Graviton2.

Check out AWS Graviton2

Benchmarking configuration:

Experiment configurations:

Instances: AWS C5 and C6g Bare Metal

OS: Ubuntu 20.04

GCC: 9.3

x265 branch, x265 with Neon optimizations. 

https://bitbucket.org/multicoreware/x265_git/commits/4bf31dc15fb6d1f93d12ecf21fad5e695f0db5c0


Baseline experimental numbers in the following spreadsheet. 
score_x265_encoding_AWS.xlsx

 Videos of choice:

We can select five video files from Google YouTube UGC dataset with different resolutions from the following link.

https://console.cloud.google.com/storage/browser/ugc-dataset/original_videos/Sports

File names:

Sports_360P-02c3.mkv

Sports_480P-0623.mkv

Sports_720P-00a1.mkv

Sports_1080P-0063.mkv

Sports_2160P-0455.mkv

Quality of encoding

We can select the following preset for regression
ultrafast superfast veryfast faster fast medium slow slower veryslow

Threads/instance choices:

frame-threads being 1 and instance ranging from 1 to the number of vCPUs on the instance.

A sample command template:

./x265 --preset $preset --frames 50 $VIDEO --input-res $INPUTRES --fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ',' --log-level error --csv csv_outfile.265

For 360P video:

./x265 --preset $preset --frames 50 Sports_360P-02c3.mkv input-res 640x360 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 480P video:

./x265 --preset $preset --frames 50 Sports_480P-0623.mkv input-res 720x576 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 720P video:

./x265 --preset $preset --frames 50 Sports_720P-00a1.mkv input-res 1280x960 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 1080P video:

./x265 --preset $preset --frames 50 Sports_1080P-0063.mkv input-res 1920x1080 fps 24 --output outfile.265 --frame-threads 1 --no-wpp --pools ','

For 2160P video:

./x265 --preset $preset --frames 50 Sports_2160P-0455.mkv input-res 3840x2160 fps 24 --output $count_outfile.265 --frame-threads 1 --no-wpp --pools ','

"Benwaggoner HEVC encoding challenge" – Stress-test

Command used:

./x265/build/aarch64-linux/x265 --input SolLevante_SDRv2_1080p24_8bit.y4m --level-idc 4.0 --preset placebo --subme 7 --sar 1 --pools +,- --ref 5 --bframes 16 -F 1 --hme --hme-search 2,3,4 --fades --frame-dup --dup-threshold 50 --tune animation --tskip --cu-lossless --rd-refine --multi-pass-opt-analysis --multi-pass-opt-distortion --keyint 120 --rc-lookahead 120 --bitrate 1000 --vbv-maxrate 4000 --vbv-bufsize 12000 --hrd --aud --colorprim bt709 --transfer bt709 --colormatrix bt709 -o SolLevante_SDR-1080p_1-4M_ultraplacebo_p3.hevc --psnr --ssim --pmode 

with Netflix input file Sol Levante's 8-bit SDR 1080p .y4m

https://1drv.ms/u/s!AlvIQZWsyeO-k9llZI15s0x3uwd_nQ?e=PlqcNz

References

  1. https://www.marketsandmarkets.com/Market-Reports/intelligent-video-analytics-market-778.html
  2. https://www.polarismarketresearch.com/industry-analysis/video-analytics-market
  3. "Benwaggoner HEVC encoding challenge" https://forum.doom9.org/showthread.php?t=175776
Anonymous
Infrastructure Solutions blog
  • Improve Apache httpd performance up to 40% by deploying on Alibaba Cloud Yitian 710 instances

    Martin Ma
    Martin Ma
    In this blog, we look at the advantages of using Alibaba Yitian 710 CPU Arm-based instances for Apache httpd compared to x86-based instances.
    • January 5, 2023
  • Deep learning inference performance on the Yitian 710

    Honglin Zhu
    Honglin Zhu
    In this blog post, we focus on Alibaba Elastic Cloud Service (ECS) powered by Yitian 710 to test and compare the performance of deep learning inference.
    • December 19, 2022
  • Improve NGINX performance up to 32% by deploying on Alibaba Cloud Yitian 710 instances

    Ker Liu
    Ker Liu
    In this blog, we look at the advantages of using Alibaba Yitian 710 CPU Arm-based instances for NGINX compared to x86-based instances.
    • December 14, 2022