Part 1: Porting to Arm Intrinsics with SIMDe

June 26, 2024

7 minute read time.

This blog presents a case study using SIMD Everywhere (SIMDe) to automatically port software using x86 SSE and AVX SIMD intrinsics to Arm Neon.

A special thanks

This case study was originally described in the academic paper Open and optimized VVC Implementations on ARM Architectures by Benjamin Bross, Christian Lehmann, Gabriel Hege, and Adam Wieckowski from the Video Communication and Applications Department of Fraunhofer HHI in Berlin (2023). We would like to thank them for their insights and open-source contributions to the developer community.

The software ported in the study is the Fraunhofer Versatile Video Encoder (VVenC) and Versatile Video Decoder (VVdeC). The paper describes how SIMDe was used to convert Intel x86 intrinsics to Arm Neon instructions and analyzes performance of the resulting Arm port compared to x86 machines.

This blog summarizes the findings of the paper and provides additional details to help you replicate the results.

Intrinsics

There are many ways to write code, each with their own advantages and disadvantages. But in the end, all code eventually runs as machine code using the instructions available on the target device’s ISA.

Writing high-level code such as C or C++ is easiest for the programmer, leaving the compiler to turn your program into machine code instructions. However, the programmer relinquishes control over the precise instructions that are chosen by the compiler.

Writing low-level code such as assembly is much more complicated for the programmer but provides a high level of control over the instructions used. Writing directly in assembly can also result in significant portability and engineering complexity costs.

Intrinsics are function calls that the compiler replaces with specific assembly instructions. Intrinsics provide a way to write code that is more easily maintained than assembler, while keeping control of which instructions are generated. This gives you direct, low-level access to the exact instructions you want, all from high-level C or C++ code.

However, because intrinsics are tightly bound to the underlying ISA, they are often not portable across architectures. For example, Intel x86 and Arm provide different SIMD intrinsics, which reflect the different Instruction Set Architectures (ISAs).

The example in this case study makes extensive use of Intel x86 intrinsics, which must be ported before the code can run on an Arm-based device.

SIMD instructions on Arm and Intel

Single Instruction, Multiple Data (SIMD) architecture enables parallel processing of data. SIMD instructions allow single instructions to simultaneously operate on multiple data values. This data is often arranged in arrays or vectors. The benefit of SIMD is that tasks processing large amounts of data execute faster than using scalar instructions.

SIMD code is often written using intrinsics. Intel x86 and Arm provide different intrinsics, which reflect the different Instruction Set Architectures (ISAs):

Intel SIMD are SIMD instructions and intrinsics that work specifically on Intel x86 chips. Intel provides different sets of SIMD intrinsics, including Streaming SIMD Extensions (SSE), MMX, and Advanced Vector eXtensions (AVX), as well as newer versions such as SSE2, SSE3 and AVX-512.
Arm Neon is an architecture extension for the Arm architecture family. Arm Neon is similar to Intel SIMD in that it uses SIMD intrinsics to process data faster. However, code which uses Neon instructions can only run on Arm-based systems.

Each SIMD instruction set only executes on the specific chipset that it is originally designed for. This means that Intel SIMD instructions do not work on Arm-based systems and vice versa. This causes a problem when trying to port code from one platform to another. Porting code from Intel SIMD to Arm Neon can be very time and labor intensive:

Porting code often requires manually translating each intrinsic.
You need time to familiarize yourself with a new instruction set.

SIMD Everywhere

SIMD Everywhere is an open-source tool that provides portable implementations of SIMD intrinsics on hardware that does not support them. For example, SIMDe lets you use Intel SSE functions on Arm-based hardware. SIMDe is a library which is header-only. This means it does not need to be separately compiled to be used. All you need is to point the compiler to the location of the headers and add a #include to your source code. SIMDe then automatically translates the original SIMD intrinsics to Neon instructions and automatically ports the code for you.

The advantages of using SIMDe include:

Eliminates the need to manually rewrite code for each individual architecture.
Does not limit functionality to the lowest common denominator. Instead, SIMDe minimizes the effort needed to port instructions. This then allows the user the space to optimize the ported instructions when needed.
You can run a non-native SIMD instruction set on your machine without having to run an emulator. This is a much easier path for code development.
Easy to run and install.

Usually, the SIMDe library offers a much higher performance to effort and time ratio then manually porting, while allowing optimization when needed.

However, with any automatic library there can be some drawbacks:

Automatic conversion results are impressive. However, manually converting the instruction sets can deliver better performance than automatic conversion.
The project is in active development and constantly changing. This means that there can be occasional new bugs. This also means that there is constant verification on the project and occasional new features.
There can be some issues when there is no native support. This means that more functions need headers such as <h>. The library is still functional, but the results of those functions are undefined.

For more information see the SIMDe github page.

Porting VVenC and VVdeC using SIMDe

The international video coding standard Versatile Video Coding (VVC) also known as H.266, ISO/IEC 23090-3, and MPEG-I Part 3 was finalized in July 2020. In September of 2020 the encoder, Fraunhofer Versatile Video Encoder (VVenC), and decoder, Fraunhofer Versatile Video Decoder (VVdeC) were released.

Both VVenC and VVdeC are open and optimized pieces of software. Originally, this software was developed for x86 architecture platforms. However, more recent developments have ported these tools to Arm platforms using Arm Neon. SIMDe enables this porting. Because the code uses x86 intrinsics, they need to be converted to run on an Arm machine. This is done automatically using SIMDe.

This section describes how to run VVenC on an Arm machine.

For an Arm-based platform you need the following software:

CMake 3.13 or later
gcc-5.0 or later

Create a simde_test directory in the desired location using:

mkdir simde_test
cd simde_test

Clone the project into the created directory using:

git clone https://github.com/fraunhoferhhi/vvenc
cd vvenc

Build the project using a Make file. There are several Makefiles available but in this case use the debug version:

make debug

Test the build by running the executable with the –version option:

./vvencapp –version

Full instructions for both VVenC and VVdeC are available here:

As you can see building VVenC and VVdeC with SIMDe is a simple process that allows x86 code to be ported to Arm-based systems.

Results

Figure 1 shows the performance improvements achieved when using different SIMD architectures. The Enc. Speedup axis shows the performance improvement for the SIMD implementation relative to the unvectorized implementation. Even with Neon code automatically generated from SSE intrinsics using SIMDe, the performance increases by more than 200%.

Graph showing the comparison of SIMD vs Scalar

Figure 1: VVenC speedup on the M1 Max (NEON and SSE4.2emulated) and Core i9 processors (SSE4.2 and AVX2)

Overall, the paper finds that Arm outperforms x86 for the tested systems but not by a large margin. However, the automatic translation from x86 SIMD intrinsics to Arm Neon using SIMDe is not the most efficient for all types of vector operations that are performed using these intrinsics. Future optimizations on this process and software will further improve the performance of VVenC and VVdeC on Arm platforms.

Look out for more blogs in the future about using SIMDe to port your code.

Useful Resources

Repositories

Fraunhofer HHI research team

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog