The growth of high-definition video content for playback on larger and higher-resolution devices has driven the need for more efficient video codecs like H.265. And while twice as bandwidth efficient as the older H.264 codec, H.265 consumes significantly higher compute resources to deliver that efficiency. Controlling cost (ex, bandwidth usage) is now the number one challenge cited by video developers [1], making H.265 attractive. But if lower bandwidth costs are replaced by higher compute and power costs, video developers can find themselves running in place. What they need is a solution that delivers H.265 efficiency without the compute and power tax. This blog makes the case that Arm Neoverse-based Ampere Altra Max servers are the solution video developers need for encoding H.265 video streams.
Over the last few years - thanks to better cameras and larger, higher-resolutions devices - there has been a steady growth in both generation and consumption of high-res video content. More modern codecs like H.265/HEVC, VP9 or AV1 are more than 50% efficient at compressing higher-resolution content compared to legacy codecs like H.264. Recent market research indicates that this growth is translating into a significant increase the usage of these codecs, with H.265 leading the pack.
Figure1: Bitmovin 2021 report on video codecs being used in production (2020 vs. 2021)
The demand for high-resolution video content is being driven by the popularity of streaming services like Netflix and Amazon Prime as well. And attracting and retaining customers only increases this demand. It is no surprise, then, that video upload and ingestion (a function of bandwidth requirements) and video transcoding and processing (a function of compute requirements) represent the largest share of video processing platforms [2].
Figure 2. Video Processing Platform Market Share, by Application, 2020
The improved compression of H.265 comes with the tradeoff of more compute complexity, which can be an order of magnitude (10x) higher than with H.264. And while the use of cloud-based encoding is growing, most video encoding remains an on-prem task [1]. So the added compute requirement (a CapEx cost) and power usage (an OpEx cost) of H.265 encoding represents a challenge for most video developers. Therefore, it is important that encoding be done on servers that are both more performant and more power efficient.
Technical media have verified the performance and power efficiency benefits of Ampere Altra Max over legacy architectures on general-purpose benchmarks such as SPECrate® 2017 Integer [3]. With 128 Arm Neoverse N1 cores @3.0Ghz, Ampere Altra Max outperforms Intel Xeon ‘Ice-Lake’ and AMD EPYC ‘Milan’ CPUs that are measured at a much higher power consumption (TDP). In this blog, we demonstrate that these performance and power efficiency benefits of Ampere Altra Max extend to video encoding applications like H.265 as well.
To show this we encode H.265 and measure actual performance and power consumption when the system is fully loaded. We include some recent optimizations efforts on the open-source libx265 encoder to use the Neon SIMD engines on 64-bit Arm architectures. These optimizations have resulted in a significant performance uplift of 1.5x – 2.2x [4].
We benchmarked the latest snapshot of libx265 open-source codec https://bitbucket.org/multicoreware/x265_git/ on comparable Arm and x86 based servers. The x265 version on all the systems was 3.5+20-17839cc0d. System details for the Ampere Altra Max servers based on Arm Neoverse N1 cores and x86 systems based on Intel ‘Ice-Lake’ and AMD ‘Milan’ architecture are shown in the Configuration section. The input videos are listed in the Configuration section. We use various resolutions and encoding presets to see the impact of performance under different scenarios.
To find the full socket performance, we started as many H.265 encoding tasks as the number of virtual cores in the system and measure the cumulative Frames Per Second (FPS). We run 128 tasks on the Altra Max and AMD 7763 CPUs and 80 tasks on the Xeon 8380 CPU. We observe that the full socket performance of Altra Max is 10% to 35% better compared to AMD EPYC 7763 and is more than 2x better than Intel Xeon 8380 across various video resolutions and encoding presets.
Figure 3: x265 Relative performance between Ampere Altra Max, AMD EPYC, and Intel Xeon
It is interesting to note the performance scaling differences between the x86 CPUs with SMT-based architecture and the single-threaded core architecture of Altra Max. With Altra Max the performance scales linearly with the number of encoding tasks in the system. On AMD EPYC 7763 and Intel Xeon 8380 performance scaling is non-linear, with performance degrading significantly once virtual cores are used.
Figure 4: x265 Performance scaling by number of jobs: Ampere Altra Max
Figure 5: x265 Performance scaling by number of jobs: AMD EPYC 7763
Figure 6: x265 Performance scaling by number of jobs: Intel Xeon 8380
The power efficiency of a platform is measured by the number of frames it encodes within a certain power budget. To measure this, we fully loaded a single socket on all platforms with the maximum number of H.265 encode tasks. When then measured the power consumed at the socket level and calculate FPS per Watt.
We found that on average, across different video resolutions and encoding presets, the Altra Max was 40-70% more efficient than AMD EPYC 7763, and up to 3x more efficient than Intel Xeon 8380.
Figure 7: x265 Relative performance per Watt between Ampere Altra Max, AMD EPYC, and Intel Xeon
With the growth in high-resolution streaming, there is a need to use higher compression codecs like H.265 for video-streaming applications in the cloud. This compression comes at a significantly higher cost of computing and higher power consumption. At the system-level, Arm Neoverse-based Ampere Altra Max servers provide better scaling and up to 2x higher performance, while offering up to 3x higher workload power efficiency compared to Intel ‘Ice Lake’ server platforms. Altra Max servers provide up to 35% higher performance than AMD Milan servers, and up to 70% higher workload power efficiency. Recent x265 optimizations for Arm architecture have created a new era in power efficient encoding with outstanding performance and we encourage the reader to evaluate Ampere Altra and Altra Max systems for x265 video encoding.
Finally, we must recognize that improving the efficiency of computing isn't a video encoding challenge, it's a general processing challenge. New architectures like Arm Neoverse and cloud-first CPU designs like Ampere Altra Max can help reduce the carbon emissions impact of computing both on-prem and in the cloud. For more on the sustainability benefits of Neoverse and Ampere Altra Max we encourage you to read our Earth Day 2022 blog.
[CTAToken URL = "https://amperecomputing.com/where-to-buy" target="_blank" text="Explore Ampere Altra Providers" class ="green"]
System configuration used in comparison
Processor
Ampere Altra Max M128-30
AMD EPYC 7763 (Milan)
Intel Xeon 8380 (Ice Lake)
Cores
128
64 cores with 128 threads
40 cores with 80 threads
TDP
250W
270W
280W
OS
CentOS Linux 8 (Core)
CentOS Linux 8
kernel
4.18.0-80.11.2.el8.20210225+amp.opt.aarch64
4.18.0-305.3.1.el8.x86_64
x265 version
3.5+20-17839cc0d
OS compiler
8.5.0
Input video files: