Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Tools, Software and IDEs blog Thinking in parallel: Branchless conditionals
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • optimization
  • NEON
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

Thinking in parallel: Branchless conditionals

Laurence Bank
Laurence Bank
September 20, 2021
8 minute read time.
This is part one in an Arm SIMD tutorial series by guest blogger Laurence Bank.

This article assumes you have a basic understanding of Arm SIMD (aka Neon) programming. SIMD is an acronym for Single Instruction Multiple Data and it refers to a set of CPU instructions that are able to process multiple pieces of data in each operation. The purpose of SIMD is to accelerate your algorithms by processing more data per clock cycle compared with ‘ordinary’ instructions. It is often necessary for programmers to explicitly write SIMD code (C intrinsics) to take advantage of its added capabilities. For more info on Arm Neon programming, please see this excellent tutorial:
Optimizing C Code with Neon Intrinsics

Often, we need to test one or more conditions in our main processing loop. C compilers have limited ability to vectorize loops with conditional statements. Let us consider a simple example: Suppose that we have a list of integers that we need to sum, but we need to skip values smaller than a given minimum. In C it might look something like this:

int32_t iSum = 0;
 
for (int i=0; i<len; i++) {
  if (array[i] >= min)
     iSum += array[i];
}

Some compilers will be able to vectorize this loop, but it is likely that the one you are using cannot. When using Arm Neon intrinsics, it is quite simple to conditionally drop unwanted values from a calculation:

int32x4_t vIn, vSum, vMask, vZero, vMin;
vSum = vZero = vdupq_n_s32(0); // set sum to 0
vMin = vdupq_n_s32(min);

for (int i=0; i<len; i+=4) {
   vIn = vld1q_s32(&array[i]);
   vMask = vcgeq_s32(vIn, vMin); // 0’s in lanes with rejected values
   vIn = vandq_s32(vIn, vMask); // mask off negative values
   vSum = vaddq_s32(vSum, vIn); // sum 4 new lanes to existing sums
}
// Now we have 4 lanes of sums that need to be added horizontally
iSum = vaddvq_s32(vSum); // horizontal sum

Vector comparison instructions result in 0’s and 1’s filling the corresponding lanes to indicate true or false. In the code above, we use this as a mask to separate the parts we want from those we do not. Let us plug in some sample values and see what the lanes will contain after this comparison:

vIn = vld1q_s32(&array[i]);    // vIn = 85, 100, -2, 22 (min = 10)
vMask = vcgeq_s32(vIn, vMin); // vMask = -1, -1, 0, -1 (-1 = 0xFFFFFFFF)
vIn = vandq_s32(vIn, vMask);   // vIn = 85, 100, 0, 22

Notice that the mask allows us to remove the rejected value without affecting the other lanes.

A more challenging problem to code with SIMD is a list search in which we want the index of the value and not the value itself. For example, consider a list of unsorted integers and we want to know the position of the min (or max) value. In scalar code it could look like this:

int32_t iMinPos, iMin = 0x7fffffff;
 
for (int i=0; i<len; i++) {
  if (array[i] < iMin) {
    iMin = array[i];
    iMinPos = i;
  }
}

This is possible to vectorize with Neon using conditional masking similar to the first example. To keep track of which element number is holding the min (or max) value, we create a new vector of indices to follow as we work through the list.

int32x4_t vIn, vMin, vMask, vMinIndices, vIndices, vIncrement;
int32x2_t vMin_2, vMask_2, vMinIndex_2;
int iMin, iMinIndex;
const int32_t start_indices[] = {0,1,2,3};
vIndices = vld1q_s32(start_indices);
vIncrement = vdupq_n_s32(4);
vMin = vdupq_n_s32(0x7fffffff); // set to max integer value to start

for (int i=0; i<len; i+=4) {
   vIn = vld1q_s32(&array[i]);
   vMask = vcltq_s32(vIn, vMin); // which lanes are less?
   vMin = vminq_s32(vIn, vMin); // keep the minimum values
   vMinIndices = vbslq_s32(vMask, vIndices, vMinIndices); // select min indices
   vIndices = vaddq_s32(vIndices, vIncrement); // update current indices
}
// Now we have 4 minimums and indices; find the min value + index
vMask_2 = vclt_s32(vget_low_s32(vMin), vget_high_s32(vMin));
vMin_2 = vmin_s32(vget_low_s32(vMin), vget_high_s32(vMin));
vMinIndex_2 = vbsl_s32(vMask_2, vget_low_s32(vMinIndices), vget_high_s32(vMinIndices));
vMask_2 = vclt_s32(vMin_2, vrev64_s32(vMin_2));
vMin_2 = vmin_s32(vMin_2, vrev64_s32(vMin_2));
vMinIndex_2 = vbsl_s32(vMask_2, vMinIndex_2, vrev64_s32(vMinIndex_2));
// Now we have the final min and index
iMin = vget_lane_s32(vMin_2, 0);
iMinIndex = vget_lane_s32(vMinIndex_2, 0);


As shown in the examples above, it is possible to manually vectorize complex loops that the compiler is unable to. Besides the obvious benefits of parallelization that vectorization provides, performance is also improved by:

  • Maximizing memory throughput by reading 128-bits at a time
  • Avoiding branches and their associated pipeline penalties.

So how does it perform on actual hardware? I created a Github repo to share the code so that you can try it yourself (https://github.com/bitbank2/min_search_arm). I tested it on the new Samsung Galaxy S21 mobile phone and the three-year old Galaxy S10. The Qualcomm SoC inside the S21 happens to have a brand new Cortex-X1 CPU.

The easiest way to run C code on a Samsung phone may not be what you expect. The normal route would be to create an Android app (written in Java) which links in native code (using what Google calls the NDK - Native Development Kit). This requires a lot of tools to be installed on your PC and has many steps involved to get things finally running on the phone. I chose a simpler path for this code that also allows it to be run on any Arm Linux machine.

Starting a few generations back, Samsung released a new feature for their top-end smartphones called DeX. Its goal was to be able to use the phone as a small PC when needed. It accomplished this by adding HDMI-out through the USB-C connection and changed the Android UI to behave more like a desktop PC while in this mode. At one point, it even supported running Arm Linux as an option, but that feature was recently deprecated. A clever developer brought some of that functionality back in the form of a user-mode only virtual environment called UserLand. It is able to run a limited form of Linux that will serve our purpose. Here are the steps needed to run command-line Linux projects on your Samsung phone:

  • Install the UserLand app
  • Optionally connect an HDMI monitor, keyboard, and mouse to your phone (for easier viewing+typing)
  • Launch UserLand
  • Choose Ubuntu as the target environment. If it asks you about SSH vs VNC, choose SSH.
  • Once the code is downloaded and a terminal window opens, type “sudo apt-get update” and then “sudo apt-get install xxx” for each of the following tools (make, gcc, git, vim)
  • After the CLI tools have successfully installed, you can clone and run the test app like this:
    git clone https://github.com/bitbank2/min_search_arm
    cd min_search_arm
    make
    ./min_demo

The test program creates a list of 10 million random integers and returns the index (position) of the lowest value. I chose the number 10 million to make sure the list did not fit in the CPU cache and to have it execute long enough to get a reasonably accurate measurement of the execution time. Below are the results I got for each phone. “Scalar” refers to the original C code, “Vector” to the Neon code and “Vector (unrolled)” is a modified version of the Neon code, which manually unrolls the loop 1x. All versions were compiled with “-O3” for maximum optimization effort on the part of the compiler:

Galaxy S21 (Cortex-X1 @ 2.84Ghz;), GCC 7.5.0

Scalar time = 9804us
Vector time = 2694us
Vector (unrolled) time = 1576us - wow 6x faster.

Galaxy S10 (Qualcomm Kyro 485 @ 2.84Ghz), GCC 7.5.0

Scalar time = 9696us
Vector time = 2807us

Vector (unrolled) time = 2766us (compiler already had unrolled the loop above )

Pretty impressive speed for a phone. With the unrolled Neon code, the X1 was able to find the index of the minimum value from a list of 10 million 32-bit integers in 1.5ms! A few things can be inferred from the results. The key thing is that the X1 in the S21 is clocked at the same rate as the ‘performance’ CPU in the S10. So, the faster results for the scalar code indicate that the compiler is more aggressive at unrolling loops on the S10. This is backed up by the fact that the simple vs unrolled version of the Neon code on the S10 executes in basically the same amount of time. A good comparison of the actual speed difference between the two generations of processors is to compare the unrolled execution time. The X1 and the Kyro 485 are clocked at the same basic speed (2.84Ghz), but the X1 can ‘boost’ it is clock up to 3.0Ghz. Still, the X1 ran the code nearly twice as fast due to its pipeline improvements.

To conclude, a few takeaways:

  • Manually writing Neon intrinsics for relatively complex loops can benefit the execution speed by a measurable amount.
  • With vector compares and logical operations, many complex conditional statements can be turned into branchless SIMD.
  • Different versions of Linux and the Android NDK can have compilers that are several generations old. The version of GCC that got installed on UserLand’s Ubuntu was from 2017. If possible, try to install the latest compiler to take advantage of recent improvements.
  • Neon intrinsics will be modified by the compiler for optimal instruction scheduling and loop unrolling. If you want the processor to execute a fixed set of instructions, you will need to write assembly language in a separate .S file.
  • If possible, check the compiler’s output to see that it comes out as expected. The Compiler Explorer is a good tool to see this in action.
  • Even the latest compilers will still have difficulty optimizing loops with conditional statements in them. This is an active area for compiler research, but it is worth knowing how to manually optimize this code with Neon SIMD until compilers can improve in this area.
Anonymous
Tools, Software and IDEs blog
  • GitHub and Arm are transforming development on Windows for developers

    Pareena Verma
    Pareena Verma
    Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
    • May 20, 2025
  • What is new in LLVM 20?

    Volodymyr Turanskyy
    Volodymyr Turanskyy
    Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
    • April 29, 2025
  • Running KleidiAI MatMul kernels in a bare-metal Arm environment

    Paul Black
    Paul Black
    Benchmarking Arm®︎ KleidiAI MatMul kernels on bare-metal with AC6, GCC, and ATfE compilers.
    • April 17, 2025