Thinking in parallel: Branchless conditionals

September 20, 2021

8 minute read time.

This is part one in an Arm SIMD tutorial series by guest blogger Laurence Bank.

This article assumes you have a basic understanding of Arm SIMD (aka Neon) programming. SIMD is an acronym for Single Instruction Multiple Data and it refers to a set of CPU instructions that are able to process multiple pieces of data in each operation. The purpose of SIMD is to accelerate your algorithms by processing more data per clock cycle compared with ‘ordinary’ instructions. It is often necessary for programmers to explicitly write SIMD code (C intrinsics) to take advantage of its added capabilities. For more info on Arm Neon programming, please see this excellent tutorial:
Optimizing C Code with Neon Intrinsics

Often, we need to test one or more conditions in our main processing loop. C compilers have limited ability to vectorize loops with conditional statements. Let us consider a simple example: Suppose that we have a list of integers that we need to sum, but we need to skip values smaller than a given minimum. In C it might look something like this:

int32_t iSum = 0;
 
for (int i=0; i<len; i++) {
  if (array[i] >= min)
     iSum += array[i];
}

Some compilers will be able to vectorize this loop, but it is likely that the one you are using cannot. When using Arm Neon intrinsics, it is quite simple to conditionally drop unwanted values from a calculation:

int32x4_t vIn, vSum, vMask, vZero, vMin;
vSum = vZero = vdupq_n_s32(0); // set sum to 0
vMin = vdupq_n_s32(min);

for (int i=0; i<len; i+=4) {
   vIn = vld1q_s32(&array[i]);
   vMask = vcgeq_s32(vIn, vMin); // 0’s in lanes with rejected values
   vIn = vandq_s32(vIn, vMask); // mask off negative values
   vSum = vaddq_s32(vSum, vIn); // sum 4 new lanes to existing sums
}
// Now we have 4 lanes of sums that need to be added horizontally
iSum = vaddvq_s32(vSum); // horizontal sum

Vector comparison instructions result in 0’s and 1’s filling the corresponding lanes to indicate true or false. In the code above, we use this as a mask to separate the parts we want from those we do not. Let us plug in some sample values and see what the lanes will contain after this comparison:

vIn = vld1q_s32(&array[i]);    // vIn = 85, 100, -2, 22 (min = 10)
vMask = vcgeq_s32(vIn, vMin); // vMask = -1, -1, 0, -1 (-1 = 0xFFFFFFFF)
vIn = vandq_s32(vIn, vMask);   // vIn = 85, 100, 0, 22

Notice that the mask allows us to remove the rejected value without affecting the other lanes.

A more challenging problem to code with SIMD is a list search in which we want the index of the value and not the value itself. For example, consider a list of unsorted integers and we want to know the position of the min (or max) value. In scalar code it could look like this:

int32_t iMinPos, iMin = 0x7fffffff;
 
for (int i=0; i<len; i++) {
  if (array[i] < iMin) {
    iMin = array[i];
    iMinPos = i;
  }
}

This is possible to vectorize with Neon using conditional masking similar to the first example. To keep track of which element number is holding the min (or max) value, we create a new vector of indices to follow as we work through the list.

int32x4_t vIn, vMin, vMask, vMinIndices, vIndices, vIncrement;
int32x2_t vMin_2, vMask_2, vMinIndex_2;
int iMin, iMinIndex;
const int32_t start_indices[] = {0,1,2,3};
vIndices = vld1q_s32(start_indices);
vIncrement = vdupq_n_s32(4);
vMin = vdupq_n_s32(0x7fffffff); // set to max integer value to start

for (int i=0; i<len; i+=4) {
   vIn = vld1q_s32(&array[i]);
   vMask = vcltq_s32(vIn, vMin); // which lanes are less?
   vMin = vminq_s32(vIn, vMin); // keep the minimum values
   vMinIndices = vbslq_s32(vMask, vIndices, vMinIndices); // select min indices
   vIndices = vaddq_s32(vIndices, vIncrement); // update current indices
}
// Now we have 4 minimums and indices; find the min value + index
vMask_2 = vclt_s32(vget_low_s32(vMin), vget_high_s32(vMin));
vMin_2 = vmin_s32(vget_low_s32(vMin), vget_high_s32(vMin));
vMinIndex_2 = vbsl_s32(vMask_2, vget_low_s32(vMinIndices), vget_high_s32(vMinIndices));
vMask_2 = vclt_s32(vMin_2, vrev64_s32(vMin_2));
vMin_2 = vmin_s32(vMin_2, vrev64_s32(vMin_2));
vMinIndex_2 = vbsl_s32(vMask_2, vMinIndex_2, vrev64_s32(vMinIndex_2));
// Now we have the final min and index
iMin = vget_lane_s32(vMin_2, 0);
iMinIndex = vget_lane_s32(vMinIndex_2, 0);

As shown in the examples above, it is possible to manually vectorize complex loops that the compiler is unable to. Besides the obvious benefits of parallelization that vectorization provides, performance is also improved by:

Maximizing memory throughput by reading 128-bits at a time
Avoiding branches and their associated pipeline penalties.

So how does it perform on actual hardware? I created a Github repo to share the code so that you can try it yourself (https://github.com/bitbank2/min_search_arm). I tested it on the new Samsung Galaxy S21 mobile phone and the three-year old Galaxy S10. The Qualcomm SoC inside the S21 happens to have a brand new Cortex-X1 CPU.

The easiest way to run C code on a Samsung phone may not be what you expect. The normal route would be to create an Android app (written in Java) which links in native code (using what Google calls the NDK - Native Development Kit). This requires a lot of tools to be installed on your PC and has many steps involved to get things finally running on the phone. I chose a simpler path for this code that also allows it to be run on any Arm Linux machine.

Starting a few generations back, Samsung released a new feature for their top-end smartphones called DeX. Its goal was to be able to use the phone as a small PC when needed. It accomplished this by adding HDMI-out through the USB-C connection and changed the Android UI to behave more like a desktop PC while in this mode. At one point, it even supported running Arm Linux as an option, but that feature was recently deprecated. A clever developer brought some of that functionality back in the form of a user-mode only virtual environment called UserLand. It is able to run a limited form of Linux that will serve our purpose. Here are the steps needed to run command-line Linux projects on your Samsung phone:

Install the UserLand app
Optionally connect an HDMI monitor, keyboard, and mouse to your phone (for easier viewing+typing)
Launch UserLand
Choose Ubuntu as the target environment. If it asks you about SSH vs VNC, choose SSH.
Once the code is downloaded and a terminal window opens, type “sudo apt-get update” and then “sudo apt-get install xxx” for each of the following tools (make, gcc, git, vim)
After the CLI tools have successfully installed, you can clone and run the test app like this:
git clone https://github.com/bitbank2/min_search_arm
cd min_search_arm
make
./min_demo

The test program creates a list of 10 million random integers and returns the index (position) of the lowest value. I chose the number 10 million to make sure the list did not fit in the CPU cache and to have it execute long enough to get a reasonably accurate measurement of the execution time. Below are the results I got for each phone. “Scalar” refers to the original C code, “Vector” to the Neon code and “Vector (unrolled)” is a modified version of the Neon code, which manually unrolls the loop 1x. All versions were compiled with “-O3” for maximum optimization effort on the part of the compiler:

Galaxy S21 (Cortex-X1 @ 2.84Ghz;), GCC 7.5.0

Scalar time = 9804us
Vector time = 2694us
Vector (unrolled) time = 1576us - wow 6x faster.

Galaxy S10 (Qualcomm Kyro 485 @ 2.84Ghz), GCC 7.5.0

Scalar time = 9696us
Vector time = 2807us

Vector (unrolled) time = 2766us (compiler already had unrolled the loop above )

Pretty impressive speed for a phone. With the unrolled Neon code, the X1 was able to find the index of the minimum value from a list of 10 million 32-bit integers in 1.5ms! A few things can be inferred from the results. The key thing is that the X1 in the S21 is clocked at the same rate as the ‘performance’ CPU in the S10. So, the faster results for the scalar code indicate that the compiler is more aggressive at unrolling loops on the S10. This is backed up by the fact that the simple vs unrolled version of the Neon code on the S10 executes in basically the same amount of time. A good comparison of the actual speed difference between the two generations of processors is to compare the unrolled execution time. The X1 and the Kyro 485 are clocked at the same basic speed (2.84Ghz), but the X1 can ‘boost’ it is clock up to 3.0Ghz. Still, the X1 ran the code nearly twice as fast due to its pipeline improvements.

To conclude, a few takeaways:

Manually writing Neon intrinsics for relatively complex loops can benefit the execution speed by a measurable amount.
With vector compares and logical operations, many complex conditional statements can be turned into branchless SIMD.
Different versions of Linux and the Android NDK can have compilers that are several generations old. The version of GCC that got installed on UserLand’s Ubuntu was from 2017. If possible, try to install the latest compiler to take advantage of recent improvements.
Neon intrinsics will be modified by the compiler for optimal instruction scheduling and loop unrolling. If you want the processor to execute a fixed set of instructions, you will need to write assembly language in a separate .S file.
If possible, check the compiler’s output to see that it comes out as expected. The Compiler Explorer is a good tool to see this in action.
Even the latest compilers will still have difficulty optimizing loops with conditional statements in them. This is an active area for compiler research, but it is worth knowing how to manually optimize this code with Neon SIMD until compilers can improve in this area.

Tools, Software and IDEs blog

Part 3: Leveraging Rust with Rich Operating Systems on Arm

Jonathan Pallant

Understand how Rust can take full advantage of running on a full-blown operating system such as Linux.
- November 15, 2024
Part 2: Integrating Rust with Real-Time Operating Systems on Arm

Jonathan Pallant

Explore how to integrate Rust with Real-Time Operating Systems (RTOS) on Arm microcontrollers and processors.
- November 13, 2024
Building Safe and Secure Software with Rust on Arm

Jonathan Pallant

Learn how Rust enhances safety and performance on Arm microcontrollers, with practical examples and insights from Jonathan Pallant.
- November 11, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Thinking in parallel: Branchless conditionals

Galaxy S21 (Cortex-X1 @ 2.84Ghz;), GCC 7.5.0

Galaxy S10 (Qualcomm Kyro 485 @ 2.84Ghz), GCC 7.5.0

Part 3: Leveraging Rust with Rich Operating Systems on Arm

Part 2: Integrating Rust with Real-Time Operating Systems on Arm

Building Safe and Secure Software with Rust on Arm