This blog presents a case study using SIMD Everywhere (SIMDe) to automatically port software using x86 SSE and AVX SIMD intrinsics to Arm Neon.
This case study was originally described in the academic paper Open and optimized VVC Implementations on ARM Architectures by Benjamin Bross, Christian Lehmann, Gabriel Hege, and Adam Wieckowski from the Video Communication and Applications Department of Fraunhofer HHI in Berlin (2023). We would like to thank them for their insights and open-source contributions to the developer community.
The software ported in the study is the Fraunhofer Versatile Video Encoder (VVenC) and Versatile Video Decoder (VVdeC). The paper describes how SIMDe was used to convert Intel x86 intrinsics to Arm Neon instructions and analyzes performance of the resulting Arm port compared to x86 machines.
This blog summarizes the findings of the paper and provides additional details to help you replicate the results.
There are many ways to write code, each with their own advantages and disadvantages. But in the end, all code eventually runs as machine code using the instructions available on the target device’s ISA.
Writing high-level code such as C or C++ is easiest for the programmer, leaving the compiler to turn your program into machine code instructions. However, the programmer relinquishes control over the precise instructions that are chosen by the compiler.
Writing low-level code such as assembly is much more complicated for the programmer but provides a high level of control over the instructions used. Writing directly in assembly can also result in significant portability and engineering complexity costs.
Intrinsics are function calls that the compiler replaces with specific assembly instructions. Intrinsics provide a way to write code that is more easily maintained than assembler, while keeping control of which instructions are generated. This gives you direct, low-level access to the exact instructions you want, all from high-level C or C++ code.
However, because intrinsics are tightly bound to the underlying ISA, they are often not portable across architectures. For example, Intel x86 and Arm provide different SIMD intrinsics, which reflect the different Instruction Set Architectures (ISAs).
The example in this case study makes extensive use of Intel x86 intrinsics, which must be ported before the code can run on an Arm-based device.
Single Instruction, Multiple Data (SIMD) architecture enables parallel processing of data. SIMD instructions allow single instructions to simultaneously operate on multiple data values. This data is often arranged in arrays or vectors. The benefit of SIMD is that tasks processing large amounts of data execute faster than using scalar instructions.
SIMD code is often written using intrinsics. Intel x86 and Arm provide different intrinsics, which reflect the different Instruction Set Architectures (ISAs):
Each SIMD instruction set only executes on the specific chipset that it is originally designed for. This means that Intel SIMD instructions do not work on Arm-based systems and vice versa. This causes a problem when trying to port code from one platform to another. Porting code from Intel SIMD to Arm Neon can be very time and labor intensive:
SIMD Everywhere is an open-source tool that provides portable implementations of SIMD intrinsics on hardware that does not support them. For example, SIMDe lets you use Intel SSE functions on Arm-based hardware. SIMDe is a library which is header-only. This means it does not need to be separately compiled to be used. All you need is to point the compiler to the location of the headers and add a #include to your source code. SIMDe then automatically translates the original SIMD intrinsics to Neon instructions and automatically ports the code for you.
The advantages of using SIMDe include:
Usually, the SIMDe library offers a much higher performance to effort and time ratio then manually porting, while allowing optimization when needed.
However, with any automatic library there can be some drawbacks:
For more information see the SIMDe github page.
The international video coding standard Versatile Video Coding (VVC) also known as H.266, ISO/IEC 23090-3, and MPEG-I Part 3 was finalized in July 2020. In September of 2020 the encoder, Fraunhofer Versatile Video Encoder (VVenC), and decoder, Fraunhofer Versatile Video Decoder (VVdeC) were released.
Both VVenC and VVdeC are open and optimized pieces of software. Originally, this software was developed for x86 architecture platforms. However, more recent developments have ported these tools to Arm platforms using Arm Neon. SIMDe enables this porting. Because the code uses x86 intrinsics, they need to be converted to run on an Arm machine. This is done automatically using SIMDe.
This section describes how to run VVenC on an Arm machine.
For an Arm-based platform you need the following software:
mkdir simde_test cd simde_test
git clone https://github.com/fraunhoferhhi/vvenc cd vvenc
make debug
./vvencapp –version
Full instructions for both VVenC and VVdeC are available here:
As you can see building VVenC and VVdeC with SIMDe is a simple process that allows x86 code to be ported to Arm-based systems.
Figure 1 shows the performance improvements achieved when using different SIMD architectures. The Enc. Speedup axis shows the performance improvement for the SIMD implementation relative to the unvectorized implementation. Even with Neon code automatically generated from SSE intrinsics using SIMDe, the performance increases by more than 200%.
Figure 1: VVenC speedup on the M1 Max (NEON and SSE4.2emulated) and Core i9 processors (SSE4.2 and AVX2)
Overall, the paper finds that Arm outperforms x86 for the tested systems but not by a large margin. However, the automatic translation from x86 SIMD intrinsics to Arm Neon using SIMDe is not the most efficient for all types of vector operations that are performed using these intrinsics. Future optimizations on this process and software will further improve the performance of VVenC and VVdeC on Arm platforms.
Look out for more blogs in the future about using SIMDe to port your code.
Repositories
Fraunhofer HHI research team
Further Reading
Excellent blog post, Khalid Saadi thank you.