With the introduction of Scalable Vector Extensions (SVE) by Arm as an optional extension in ARMv8-2, compiler auto-vectorizers have a choice between optimizing for SVE or Neon. Programmers can influence that choice through the gcc -march compiler flag. For example -march=armv8.2-a+sve enables SVE on Armv8.2-A and -march=armv9-a+nosve disables SVE on Armv9-A.
One important feature distinguishing SVE from Neon is predication applied to each element (lane) of a vector. By using vector predication, SVE can often vectorize loops that Neon could not. Sometimes where a loop can be vectorized using SVE or Neon, the SVE implementation is more effective. For example, SVE predication can eliminate the need for some vector compares and selects that Neon vectorization requires.
A good description of SVE and of these two key properties can be found in the IEEE Micro paper “The Arm Scalable Vector Extension” (Stephens, et. al., 2017)[1]. More detail with examples and comparisons of SVE with Neon are found in the white paper “A sneak peek into SVE and VLA programming” (F. Petrogalli, 2020)[2]. Lastly, application of SVE to machine learning is found in “Arm Scalable Vector Extensions and application to Machine Learning: (D. A. Iliescu and F. Petrogalli, 2018)[3].
This blog describes a case study in vectorizing a hot loop that appears in the HACCmk benchmark.
Consider the code for HACCmk which was one of the benchmarks in the US government’s ASC CORAL RFP. It is an n-body code to compute the gravitational forces on one body of a group of n bodies due to the other bodies.
The computational kernel that matters in HACCmk appears in the function GravityForceKernel(…) and is shown below.
for (int i = 0; i < n; ++i) {
float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;
float r2 = dx * dx + dy * dy + dz * dz;
if (r2 >= MaxSepSqrd || r2 == 0.0f)
continue;
float r2s = r2 + SofteningLenSqrd;
float f = PolyCoefficients[PolyOrder];
for (int p = 1; p <= PolyOrder; ++p)
f = PolyCoefficients[PolyOrder-p] + r2*f;
f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];
lax += f * dx;
lay += f * dy;
laz += f * dz;
}
ax += lax;
ay += lay;
az += laz;
The bolded if statement skips the force computation between pairs of bodies that are a large distance apart (force assumed to be negligible) or a body with itself. This action is a way of pruning the numbers of calculations required to speed up execution time at the expense of some precision.
In terms of loop vectorization, conditional statements inside a loop often prevent vectorization from occurring. In certain simple cases compilers can perform an if-conversion to allow the resulting loop to vectorize. If-conversion typically computes results for both the taken and non-taken paths and uses a conditional select instruction instead of having a branch, however this result is not always possible. Other times, it is possible but deemed suboptimal compared to generating non-vector code.
In this HACCmk kernel, if-conversion was deemed not beneficial by the compiler. Likely because the computation is expensive and has multiple variables which require conditional selects for each. Branching around the force computation when it is not required was deemed higher performance. As a result, the loop is not vectorizable using Neon. We can confirm this condition with the -fopt-vec-info-missed flag of gcc which prints information about vectorization attempts that failed. In this case, it gives the following reason,
<source>:21:23: missed: couldn't vectorize loop
<source>:21:23: missed: not vectorized: control flow in loop.
This code is a good example where predication in SVE can increase vectorization opportunities. Predication allows the conditional statement to be handled within a vector on a per-element basis. In other words, with SVE a predicate vector can be calculated that specifies which vector elements are updated with new force calculations and which are left unchanged.
The following code generated by gcc 12.1 when compiling for SVE and Neon. The Neon compilation was unable to vectorize it and produced scalar code).
The bolded text in the assembly listings indicate the code associated with the conditional statement.
In the scalar code, the compare with 0.0 (fcmp) and the later conditional compare (fccmpe) test the two conditions that execute the continue statement and skip the rest of the loop body (bge .L3) if either is true.
For the SVE case, the predicate registers p2 and p3 handle the two conditions for skipping the force computation, one for each condition. These results are then nor-ed together into p2 that controls which vector elements have the force calculation done for them and which vector elements are left unmodified.
While the Neon compilation failed to vectorize this loop due to control flow in it, this is not always the case. In this code, the continue statement functions as a goto back to the top of the loop. Sometimes the compiler can use if-conversion to change a control dependency into a data dependency and then vectorize the loop.
Sometimes, if-conversion changes a compare and branch sequence into a conditional select of two values based on the original condition. In other cases, a compare and branch sequence is replaced by masking operations that either modify a variable or leave it unchanged.
For this code, if-conversion entails doing the force calculation in every loop iteration. It then uses a mask to either add the calculated value or a zero to lax, lay, and laz at the bottom of the loop.
Such a rewrite results in performing some floating-point calculations that would not have been done in the original code. The compiler has no way to know if these additional floating-point operations might cause exceptions that would not have occurred in the original code. In gcc, such optimizations are only done if -fno-trapping-math is used, which is included in -Ofast for gcc. So under -Ofast, gcc is allowed to do such a rewrite but did not, either because it thought it unprofitable or failed to see the opportunity.
However, rewriting the loop to explicitly do if-conversion by hand in the source code can coax the compiler to vectorize it with Advanced SIMD.
int mask, tmp1;
// remove if statement and continue statement
// calculate mask based on condition being T/F
mask = (r2 >= MaxSepSqrd || r2 == 0.0f);
// tmp1 assigned either f or 0 based on mask
tmp1 = (*(unsigned int *)&f) & mask; // make f look like int
lax += (*(float *) &tmp1) * dx; // make tmp1 look like float
lay += (*(float *) &tmp1) * dy; // make tmp1 look like float
laz += (*(float *) &tmp1) * dz; // make tmp1 look like float
Here the if and continue statements are gone and the values added to lax. lay, and laz will be 0 or the calculated force value, based on the same condition previously used for the continue. The value being added is determined by the mask variable.
This results in the following Neon vector code being generated.
movi v22.4s, 0x1
.L4:
ldr q18, [x2, x8]
ldr q17, [x1, x8]
fsub v18.4s, v18.4s, v8.4s
ldr q16, [x3, x8]
fsub v17.4s, v17.4s, v9.4s
fmul v5.4s, v18.4s, v18.4s
fsub v16.4s, v16.4s, v31.4s
mov v14.16b, v27.16b
mov v13.16b, v26.16b
fmla v5.4s, v17.4s, v17.4s
ldr q10, [x4, x8]
add x8, x8, 16
fmla v5.4s, v16.4s, v16.4s
fadd v12.4s, v30.4s, v5.4s
fcmeq v11.4s, v5.4s, 0
fmla v14.4s, v5.4s, v28.4s
fcmge v7.4s, v5.4s, v29.4s
fsqrt v6.4s, v12.4s
fmla v13.4s, v14.4s, v5.4s
orr v7.16b, v7.16b, v11.16b
mov v11.16b, v25.16b
and v7.16b, v22.16b, v7.16b
fmla v11.4s, v13.4s, v5.4s
fmul v6.4s, v6.4s, v12.4s
fdiv v6.4s, v24.4s, v6.4s
fadd v6.4s, v6.4s, v23.4s
fmls v6.4s, v11.4s, v5.4s
fmul v5.4s, v10.4s, v6.4s
and v5.16b, v7.16b, v5.16b
fmla v20.4s, v18.4s, v5.4s
fmla v21.4s, v17.4s, v5.4s
fmla v19.4s, v16.4s, v5.4s
cmp x8, x9
bne .L4
The force calculation is always done (even for far objects and an object with itself). In other words, the computation is no longer pruned to approximate and speed up the solution. However, the calculated value is effectively discarded (replaced by zero) when the pruning conditions are met. This is accomplished by the two compares (fcmgt and fcmeq) followed by the orr, two and instructions, and two converts between float and integer. Together these instructions zero out any vector elements where the pruning conditions were met. Then the last three fmla instructions updates laz, lay, and laz.
The performance of these three versions was evaluated using an internal cycle accurate simulator. The CPU core modeled was a Neoverse V1 and the same input dataset was used in all runs. The Neoverse V1[4] core has two 256-bit wide SVE execution units (2x256) and four 128-bit wide Advanced SIMD execution units (4x128).
For Neoverse V1, one would need twice as many Neon instructions as SVE instructions to perform a fixed amount of vectorized work. The total vector bandwidth (512 vector bits / cycle) is the same in Neoverse V1 for SVE and Neon.
The following table shows the execution times in cycles and other statistics for each of HACCmk binaries:
The Neoverse V1 can accomplish the same amount of floating-point work using two 256-bit SVE instructions per cycle or four 128-bit Neon instructions per cycle. This is a case where IPC comparisons can mislead because each SVE instruction does the work of two Neon instructions.
The simulator used can also provide execution counts per instruction address. This provides the number of iterations that the hot loop in each binary was executed. The floating-point operations (FLOPs) per iteration is calculated by examination of the disassembly. The original scalar code has 28 FLOPs in the hot loop if a static analysis is done. But due to part of the loop sometimes getting pruned (4.5% of iterations for this input dataset), the dynamic FLOPs per iteration works out to be 27.33. Multiplying the FLOPs per iteration by number of iterations shows that each binary is doing the same total amount of FP work[6].
Vectorizing the original scalar code to use Neon reduced the number of instructions needed by 65%[7]. This despite any extra instructions executed due to the vector Neon version no longer pruning the computation for very far objects or objects with themselves. Doing some wasted work and discarding the results was still beneficial since the Neon vector code reduced execution cycles by 63% from the original scalar code.
The SVE version retained the computation pruning (using predication) of the algorithm and performed a further 26% faster than the vectorized Neon version. While pruning the computation through predication likely has minimal impact on the number of instructions executed, using SVE provided a slightly different mix of instructions and resulted in fewer and shorter data dependency chains and improved instruction flow.
Having detailed cycle by cycle simulation output allows comparison of the fraction of execution cycles spent in the hot loops for each executable. The simulator provides the number of times each instruction was executed and how many cycles it waited to retire after becoming the oldest instruction (program order) in the machine. The following stats are based on these counts.
This shows approximately 95% or more of the total cycles executed were in the hot loop for all cases.
In the case or the original scalar code, the force computation is pruned when the objects are far apart. The detailed simulator output showed that 4.5% of the loop iterations were pruned (that is, 95.5% of the iterations did the force computation). The cycles spent doing force computation comprised 93% of all loop cycles. For the input data used, there was not enough pruning in the original code to outweigh the gains from vectorizing with Neon code despite some wasted computation.
Comparing the two vectorized versions (Neon and SVE) reveals the likely reason SVE outperformed Neon. Categorizing the instructions in the two hot loops shows:
The instruction mixes are almost the same except for differences in MOV, MOVPRFX, and the logical instructions BIC, AND, NOR. The SVE code uses a NOR to set certain predicate register bits while the Neon code uses BIC and three ANDs to mask off the vector elements that should not be modified.
In the Neon version, MOVs are used to make copies of registers that must be preserved across iterations. For SVE, the MOVPRFX provides this functionality by telling the hardware that the immediately following instruction can be converted by hardware from a destructive operation (like FMLA) into a constructive operation (like FMADD). It is only a hint, and the hardware can choose whether to treat it as a MOV or to convert it and emit micro-ops for a constructive operation. Doing so preserves the source of the MOV without the need for an explicit MOV. This conversion would typically be done in the machine front end during micro-op generation.
The extra logical instructions for Neon (BIC and AND) add more instructions and pressure to the machine as do the MOVs compared to this SVE implementation. For SVE, using MOVPRFX hints and per-element predication allows for fewer instructions. Combined, these features can eliminate a cycle or two from each loop iteration which adds up in such a hot loop.
HACCmk illustrates how SVE can vectorize loops that typically do not vectorize or are very difficult to vectorize without per-lane predication. In the example shown here, the hot loop of HACCmk was only vectorized using Neon after a rewrite knowing why vectorization failed and how to coax the compiler into vectorizing. Programmers who are not familiar with compiler internals get the benefit of vectorization in more cases with SVE without needing compiler expert rewrites.
[CTAToken URL = "https://developer.arm.com/-/media/Arm%20Developer%20Community/Images/White%20Paper%20and%20Webinar%20Images/HPC%20White%20Papers/a-sneak-peek-into-sve-and-vla-programming.pdf" target="_blank" text="Download SVE Whitepaper" class ="green"]
Brian, brilliant article,
trying some of those proposals on Fedora 36 on AWS Graviton 3 for cloudflare's/aws zlib.
there is small thing, gcc option is named -fopt-info-missed-vec,order of parameters different. :)
Thanks for sharing,Karol