This article assumes you have a basic understanding of Arm SIMD (aka Neon) programming. SIMD is an acronym for Single Instruction Multiple Data and it refers to a set of CPU instructions that are able to process multiple pieces of data in each operation. The purpose of SIMD is to accelerate your algorithms by processing more data per clock cycle compared with ‘ordinary’ instructions. It is often necessary for programmers to explicitly write SIMD code (C intrinsics) to take advantage of its added capabilities. For more info on Arm Neon programming, please see this excellent tutorial:Optimizing C Code with Neon Intrinsics
Often, we need to test one or more conditions in our main processing loop. C compilers have limited ability to vectorize loops with conditional statements. Let us consider a simple example: Suppose that we have a list of integers that we need to sum, but we need to skip values smaller than a given minimum. In C it might look something like this:
int32_t iSum = 0; for (int i=0; i<len; i++) { if (array[i] >= min) iSum += array[i]; }
Some compilers will be able to vectorize this loop, but it is likely that the one you are using cannot. When using Arm Neon intrinsics, it is quite simple to conditionally drop unwanted values from a calculation:
int32x4_t vIn, vSum, vMask, vZero, vMin; vSum = vZero = vdupq_n_s32(0); // set sum to 0 vMin = vdupq_n_s32(min); for (int i=0; i<len; i+=4) { vIn = vld1q_s32(&array[i]); vMask = vcgeq_s32(vIn, vMin); // 0’s in lanes with rejected values vIn = vandq_s32(vIn, vMask); // mask off negative values vSum = vaddq_s32(vSum, vIn); // sum 4 new lanes to existing sums } // Now we have 4 lanes of sums that need to be added horizontally iSum = vaddvq_s32(vSum); // horizontal sum
Vector comparison instructions result in 0’s and 1’s filling the corresponding lanes to indicate true or false. In the code above, we use this as a mask to separate the parts we want from those we do not. Let us plug in some sample values and see what the lanes will contain after this comparison:
vIn = vld1q_s32(&array[i]); // vIn = 85, 100, -2, 22 (min = 10) vMask = vcgeq_s32(vIn, vMin); // vMask = -1, -1, 0, -1 (-1 = 0xFFFFFFFF) vIn = vandq_s32(vIn, vMask); // vIn = 85, 100, 0, 22
Notice that the mask allows us to remove the rejected value without affecting the other lanes.
A more challenging problem to code with SIMD is a list search in which we want the index of the value and not the value itself. For example, consider a list of unsorted integers and we want to know the position of the min (or max) value. In scalar code it could look like this:
int32_t iMinPos, iMin = 0x7fffffff; for (int i=0; i<len; i++) { if (array[i] < iMin) { iMin = array[i]; iMinPos = i; } }
This is possible to vectorize with Neon using conditional masking similar to the first example. To keep track of which element number is holding the min (or max) value, we create a new vector of indices to follow as we work through the list.
int32x4_t vIn, vMin, vMask, vMinIndices, vIndices, vIncrement; int32x2_t vMin_2, vMask_2, vMinIndex_2; int iMin, iMinIndex; const int32_t start_indices[] = {0,1,2,3}; vIndices = vld1q_s32(start_indices); vIncrement = vdupq_n_s32(4); vMin = vdupq_n_s32(0x7fffffff); // set to max integer value to start for (int i=0; i<len; i+=4) { vIn = vld1q_s32(&array[i]); vMask = vcltq_s32(vIn, vMin); // which lanes are less? vMin = vminq_s32(vIn, vMin); // keep the minimum values vMinIndices = vbslq_s32(vMask, vIndices, vMinIndices); // select min indices vIndices = vaddq_s32(vIndices, vIncrement); // update current indices } // Now we have 4 minimums and indices; find the min value + index vMask_2 = vclt_s32(vget_low_s32(vMin), vget_high_s32(vMin)); vMin_2 = vmin_s32(vget_low_s32(vMin), vget_high_s32(vMin)); vMinIndex_2 = vbsl_s32(vMask_2, vget_low_s32(vMinIndices), vget_high_s32(vMinIndices)); vMask_2 = vclt_s32(vMin_2, vrev64_s32(vMin_2)); vMin_2 = vmin_s32(vMin_2, vrev64_s32(vMin_2)); vMinIndex_2 = vbsl_s32(vMask_2, vMinIndex_2, vrev64_s32(vMinIndex_2)); // Now we have the final min and index iMin = vget_lane_s32(vMin_2, 0); iMinIndex = vget_lane_s32(vMinIndex_2, 0);
As shown in the examples above, it is possible to manually vectorize complex loops that the compiler is unable to. Besides the obvious benefits of parallelization that vectorization provides, performance is also improved by:
So how does it perform on actual hardware? I created a Github repo to share the code so that you can try it yourself (https://github.com/bitbank2/min_search_arm). I tested it on the new Samsung Galaxy S21 mobile phone and the three-year old Galaxy S10. The Qualcomm SoC inside the S21 happens to have a brand new Cortex-X1 CPU.
The easiest way to run C code on a Samsung phone may not be what you expect. The normal route would be to create an Android app (written in Java) which links in native code (using what Google calls the NDK - Native Development Kit). This requires a lot of tools to be installed on your PC and has many steps involved to get things finally running on the phone. I chose a simpler path for this code that also allows it to be run on any Arm Linux machine.
Starting a few generations back, Samsung released a new feature for their top-end smartphones called DeX. Its goal was to be able to use the phone as a small PC when needed. It accomplished this by adding HDMI-out through the USB-C connection and changed the Android UI to behave more like a desktop PC while in this mode. At one point, it even supported running Arm Linux as an option, but that feature was recently deprecated. A clever developer brought some of that functionality back in the form of a user-mode only virtual environment called UserLand. It is able to run a limited form of Linux that will serve our purpose. Here are the steps needed to run command-line Linux projects on your Samsung phone:
The test program creates a list of 10 million random integers and returns the index (position) of the lowest value. I chose the number 10 million to make sure the list did not fit in the CPU cache and to have it execute long enough to get a reasonably accurate measurement of the execution time. Below are the results I got for each phone. “Scalar” refers to the original C code, “Vector” to the Neon code and “Vector (unrolled)” is a modified version of the Neon code, which manually unrolls the loop 1x. All versions were compiled with “-O3” for maximum optimization effort on the part of the compiler:
Scalar time = 9804us Vector time = 2694us Vector (unrolled) time = 1576us - wow 6x faster.
Scalar time = 9696us Vector time = 2807us
Vector (unrolled) time = 2766us (compiler already had unrolled the loop above )
Pretty impressive speed for a phone. With the unrolled Neon code, the X1 was able to find the index of the minimum value from a list of 10 million 32-bit integers in 1.5ms! A few things can be inferred from the results. The key thing is that the X1 in the S21 is clocked at the same rate as the ‘performance’ CPU in the S10. So, the faster results for the scalar code indicate that the compiler is more aggressive at unrolling loops on the S10. This is backed up by the fact that the simple vs unrolled version of the Neon code on the S10 executes in basically the same amount of time. A good comparison of the actual speed difference between the two generations of processors is to compare the unrolled execution time. The X1 and the Kyro 485 are clocked at the same basic speed (2.84Ghz), but the X1 can ‘boost’ it is clock up to 3.0Ghz. Still, the X1 ran the code nearly twice as fast due to its pipeline improvements.
To conclude, a few takeaways: