A case study in vectorizing HACCmk using SVE

November 3, 2022

14 minute read time.

Introduction

With the introduction of Scalable Vector Extensions (SVE) by Arm as an optional extension in ARMv8-2, compiler auto-vectorizers have a choice between optimizing for SVE or Neon. Programmers can influence that choice through the gcc -march compiler flag. For example -march=armv8.2-a+sve enables SVE on Armv8.2-A and -march=armv9-a+nosve disables SVE on Armv9-A.

One important feature distinguishing SVE from Neon is predication applied to each element (lane) of a vector. By using vector predication, SVE can often vectorize loops that Neon could not. Sometimes where a loop can be vectorized using SVE or Neon, the SVE implementation is more effective. For example, SVE predication can eliminate the need for some vector compares and selects that Neon vectorization requires.

A good description of SVE and of these two key properties can be found in the IEEE Micro paper “The Arm Scalable Vector Extension” (Stephens, et. al., 2017)[1]. More detail with examples and comparisons of SVE with Neon are found in the white paper “A sneak peek into SVE and VLA programming” (F. Petrogalli, 2020)[2]. Lastly, application of SVE to machine learning is found in “Arm Scalable Vector Extensions and application to Machine Learning: (D. A. Iliescu and F. Petrogalli, 2018)[3].

This blog describes a case study in vectorizing a hot loop that appears in the HACCmk benchmark.

SVE predication provides more vectorization opportunities

Consider the code for HACCmk which was one of the benchmarks in the US government’s ASC CORAL RFP. It is an n-body code to compute the gravitational forces on one body of a group of n bodies due to the other bodies.

The computational kernel that matters in HACCmk appears in the function GravityForceKernel(…) and is shown below.

for (int i = 0; i < n; ++i) {

float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;

float r2 = dx * dx + dy * dy + dz * dz;

if (r2 >= MaxSepSqrd || r2 == 0.0f)

continue;

float r2s = r2 + SofteningLenSqrd;

float f = PolyCoefficients[PolyOrder];

for (int p = 1; p <= PolyOrder; ++p)

f = PolyCoefficients[PolyOrder-p] + r2*f;

f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];

lax += f * dx;

lay += f * dy;

laz += f * dz;

}

ax += lax;

ay += lay;

az += laz;

The bolded if statement skips the force computation between pairs of bodies that are a large distance apart (force assumed to be negligible) or a body with itself. This action is a way of pruning the numbers of calculations required to speed up execution time at the expense of some precision.

In terms of loop vectorization, conditional statements inside a loop often prevent vectorization from occurring. In certain simple cases compilers can perform an if-conversion to allow the resulting loop to vectorize. If-conversion typically computes results for both the taken and non-taken paths and uses a conditional select instruction instead of having a branch, however this result is not always possible. Other times, it is possible but deemed suboptimal compared to generating non-vector code.

In this HACCmk kernel, if-conversion was deemed not beneficial by the compiler. Likely because the computation is expensive and has multiple variables which require conditional selects for each. Branching around the force computation when it is not required was deemed higher performance. As a result, the loop is not vectorizable using Neon. We can confirm this condition with the -fopt-vec-info-missed flag of gcc which prints information about vectorization attempts that failed. In this case, it gives the following reason,

<source>:21:23: missed: couldn't vectorize loop

<source>:21:23: missed: not vectorized: control flow in loop.

This code is a good example where predication in SVE can increase vectorization opportunities. Predication allows the conditional statement to be handled within a vector on a per-element basis. In other words, with SVE a predicate vector can be calculated that specifies which vector elements are updated with new force calculations and which are left unchanged.

The following code generated by gcc 12.1 when compiling for SVE and Neon. The Neon compilation was unable to vectorize it and produced scalar code).

Code compiled for Neon

(-Ofast -march=armv8.6-a+simd)

https://godbolt.org/z/98Tv5aEbo

Code compiled for SVE

(-Ofast -march=armv8.6-a+sve)

https://godbolt.org/z/hKrsG8j8b

.L4:

ldr s16, [x2, x0]

ldr s17, [x1, x0]

ldr s7, [x3, x0]

fsub s16, s16, s1

fsub s17, s17, s0

fsub s7, s7, s2

fmul s5, s16, s16

fmadd s5, s17, s17, s5

fmadd s5, s7, s7, s5

fmadd s18, s5, s22, s21

fcmp s5, #0.0

fadd s6, s5, s4

fccmpe s5, s3, 0, ne

fmadd s18, s5, s18, s20

bge .L3

fsqrt s29, s6

fmadd s18, s5, s18, s26

ldr s28, [x4, x0]

fmsub s5, s5, s18, s25

fmul s6, s6, s29

fdiv s6, s27, s6

fadd s5, s6, s5

fmul s5, s5, s28

fmadd s24, s17, s5, s24

fmadd s23, s16, s5, s23

fmadd s19, s7, s5, s19

.L3:

add x0, x0, 4

cmp x8, x0

bne .L4

.L3:

ld1w z3.s, p1/z, [x2, x8, lsl 2]

ld1w z4.s, p1/z, [x1, x8, lsl 2]

ld1w z2.s, p1/z, [x3, x8, lsl 2]

fsub z3.s, z3.s, z20.s

fsub z4.s, z4.s, z21.s

fmul z0.s, z3.s, z3.s

fsub z2.s, z2.s, z19.s

fmla z0.s, p0/m, z4.s, z4.s

fmla z0.s, p0/m, z2.s, z2.s

fadd z1.s, z0.s, z17.s

fcmeq p3.s, p0/z, z0.s, #0.0

movprfx z28, z1

fsqrt z28.s, p0/m, z1.s

fmul z1.s, z1.s, z28.s

fdivr z1.s, p0/m, z1.s, z22.s

fadd z1.s, z1.s, z23.s

movprfx z5, z26

fmla z5.s, p0/m, z0.s, z27.s

fcmge p2.s, p0/z, z0.s, z18.s

fmad z5.s, p0/m, z0.s, z25.s

nor p2.b, p0/z, p2.b, p3.b

fmad z5.s, p0/m, z0.s, z24.s

and p3.b, p1/z, p2.b, p2.b

fmsb z0.s, p0/m, z5.s, z1.s

ld1w z1.s, p3/z, [x4, x8, lsl 2]

add x8, x8, x9

fmul z0.s, z0.s, z1.s

movprfx z4.s, p2/z, z4.s

fmul z4.s, p2/m, z4.s, z0.s

movprfx z3.s, p2/z, z3.s

fmul z3.s, p2/m, z3.s, z0.s

fadd z16.s, p1/m, z16.s, z4.s

fadd z7.s, p1/m, z7.s, z3.s

movprfx z2.s, p2/z, z2.s

fmul z2.s, p2/m, z2.s, z0.s

fadd z6.s, p1/m, z6.s, z2.s

whilelo p1.s, w8, w0

b.any .L3

faddv s6, p0, z6.s

faddv s7, p0, z7.s

faddv s16, p0, z16.s

The bolded text in the assembly listings indicate the code associated with the conditional statement.

In the scalar code, the compare with 0.0 (fcmp) and the later conditional compare (fccmpe) test the two conditions that execute the continue statement and skip the rest of the loop body (bge .L3) if either is true.

For the SVE case, the predicate registers p2 and p3 handle the two conditions for skipping the force computation, one for each condition. These results are then nor-ed together into p2 that controls which vector elements have the force calculation done for them and which vector elements are left unmodified.

Getting Neon to vectorize

While the Neon compilation failed to vectorize this loop due to control flow in it, this is not always the case. In this code, the continue statement functions as a goto back to the top of the loop. Sometimes the compiler can use if-conversion to change a control dependency into a data dependency and then vectorize the loop.

Sometimes, if-conversion changes a compare and branch sequence into a conditional select of two values based on the original condition. In other cases, a compare and branch sequence is replaced by masking operations that either modify a variable or leave it unchanged.

For this code, if-conversion entails doing the force calculation in every loop iteration. It then uses a mask to either add the calculated value or a zero to lax, lay, and laz at the bottom of the loop.

Such a rewrite results in performing some floating-point calculations that would not have been done in the original code. The compiler has no way to know if these additional floating-point operations might cause exceptions that would not have occurred in the original code. In gcc, such optimizations are only done if -fno-trapping-math is used, which is included in -Ofast for gcc. So under -Ofast, gcc is allowed to do such a rewrite but did not, either because it thought it unprofitable or failed to see the opportunity.

However, rewriting the loop to explicitly do if-conversion by hand in the source code can coax the compiler to vectorize it with Advanced SIMD.

for (int i = 0; i < n; ++i) {

float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;

float r2 = dx * dx + dy * dy + dz * dz;

int mask, tmp1;

// remove if statement and continue statement

float r2s = r2 + SofteningLenSqrd;

float f = PolyCoefficients[PolyOrder];

for (int p = 1; p <= PolyOrder; ++p)

f = PolyCoefficients[PolyOrder-p] + r2*f;

f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];

// calculate mask based on condition being T/F

mask = (r2 >= MaxSepSqrd || r2 == 0.0f);

// tmp1 assigned either f or 0 based on mask

tmp1 = (*(unsigned int *)&f) & mask; // make f look like int

lax += (*(float *) &tmp1) * dx; // make tmp1 look like float

lay += (*(float *) &tmp1) * dy; // make tmp1 look like float

laz += (*(float *) &tmp1) * dz; // make tmp1 look like float

}

ax += lax;

ay += lay;

az += laz;

Here the if and continue statements are gone and the values added to lax. lay, and laz will be 0 or the calculated force value, based on the same condition previously used for the continue. The value being added is determined by the mask variable.

This results in the following Neon vector code being generated.

Code compiled for Neon after rewrite

(-Ofast -march=armv8.6-a+simd)

https://godbolt.org/z/77Pb1d6b6

movi v22.4s, 0x1

.L4:

ldr q18, [x2, x8]

ldr q17, [x1, x8]

fsub v18.4s, v18.4s, v8.4s

ldr q16, [x3, x8]

fsub v17.4s, v17.4s, v9.4s

fmul v5.4s, v18.4s, v18.4s

fsub v16.4s, v16.4s, v31.4s

mov v14.16b, v27.16b

mov v13.16b, v26.16b

fmla v5.4s, v17.4s, v17.4s

ldr q10, [x4, x8]

add x8, x8, 16

fmla v5.4s, v16.4s, v16.4s

fadd v12.4s, v30.4s, v5.4s

fcmeq v11.4s, v5.4s, 0

fmla v14.4s, v5.4s, v28.4s

fcmge v7.4s, v5.4s, v29.4s

fsqrt v6.4s, v12.4s

fmla v13.4s, v14.4s, v5.4s

orr v7.16b, v7.16b, v11.16b

mov v11.16b, v25.16b

and v7.16b, v22.16b, v7.16b

fmla v11.4s, v13.4s, v5.4s

fmul v6.4s, v6.4s, v12.4s

fdiv v6.4s, v24.4s, v6.4s

fadd v6.4s, v6.4s, v23.4s

fmls v6.4s, v11.4s, v5.4s

fmul v5.4s, v10.4s, v6.4s

and v5.16b, v7.16b, v5.16b

fmla v20.4s, v18.4s, v5.4s

fmla v21.4s, v17.4s, v5.4s

fmla v19.4s, v16.4s, v5.4s

cmp x8, x9

bne .L4

The force calculation is always done (even for far objects and an object with itself). In other words, the computation is no longer pruned to approximate and speed up the solution. However, the calculated value is effectively discarded (replaced by zero) when the pruning conditions are met. This is accomplished by the two compares (fcmgt and fcmeq) followed by the orr, two and instructions, and two converts between float and integer. Together these instructions zero out any vector elements where the pruning conditions were met. Then the last three fmla instructions updates laz, lay, and laz.

Performance comparison of these three executables

The performance of these three versions was evaluated using an internal cycle accurate simulator. The CPU core modeled was a Neoverse V1 and the same input dataset was used in all runs. The Neoverse V1[4] core has two 256-bit wide SVE execution units (2x256) and four 128-bit wide Advanced SIMD execution units (4x128).

For Neoverse V1, one would need twice as many Neon instructions as SVE instructions to perform a fixed amount of vectorized work. The total vector bandwidth (512 vector bits / cycle) is the same in Neoverse V1 for SVE and Neon.

The following table shows the execution times in cycles and other statistics for each of HACCmk binaries:

	Original (scalar)	Vectorized Neon	Vectorized SVE
CPU_CYCLES	31,678,652	9,775,068	8,856,978
INST_RETIRED	74,230,037	23,740,887	13,390,886
FLOPs[5]/iteration (hot loop)	28 (27.33 avg)	112	224
# bytes loaded / iteration (hot loop)	16	64	128
# iterations (hot loop)	2,477,008	618,439	310,591
IPC	2.343	2.429	1.512

The Neoverse V1 can accomplish the same amount of floating-point work using two 256-bit SVE instructions per cycle or four 128-bit Neon instructions per cycle. This is a case where IPC comparisons can mislead because each SVE instruction does the work of two Neon instructions.

The simulator used can also provide execution counts per instruction address. This provides the number of iterations that the hot loop in each binary was executed. The floating-point operations (FLOPs) per iteration is calculated by examination of the disassembly. The original scalar code has 28 FLOPs in the hot loop if a static analysis is done. But due to part of the loop sometimes getting pruned (4.5% of iterations for this input dataset), the dynamic FLOPs per iteration works out to be 27.33. Multiplying the FLOPs per iteration by number of iterations shows that each binary is doing the same total amount of FP work[6].

Vectorizing the original scalar code to use Neon reduced the number of instructions needed by 65%[7]. This despite any extra instructions executed due to the vector Neon version no longer pruning the computation for very far objects or objects with themselves. Doing some wasted work and discarding the results was still beneficial since the Neon vector code reduced execution cycles by 63% from the original scalar code.

The SVE version retained the computation pruning (using predication) of the algorithm and performed a further 26% faster than the vectorized Neon version. While pruning the computation through predication likely has minimal impact on the number of instructions executed, using SVE provided a slightly different mix of instructions and resulted in fewer and shorter data dependency chains and improved instruction flow.

Having detailed cycle by cycle simulation output allows comparison of the fraction of execution cycles spent in the hot loops for each executable. The simulator provides the number of times each instruction was executed and how many cycles it waited to retire after becoming the oldest instruction (program order) in the machine. The following stats are based on these counts.

	SVE		Original (scalar)		Vectorized Neon
	Insts	Cycles	Insts	Cycles	Insts	Cycles
hot loop	12,113,049	8,390,976	72,973,140	31,236,139	24,737,560	11,245,318
entire run	13,390,886	8,856,971	74,230,037	31,678,470	26,182,871	11,734,913
Loop % of total	90.5%	94.7%	98.3%	98.6%	94.5%	95.8%

This shows approximately 95% or more of the total cycles executed were in the hot loop for all cases.

In the case or the original scalar code, the force computation is pruned when the objects are far apart. The detailed simulator output showed that 4.5% of the loop iterations were pruned (that is, 95.5% of the iterations did the force computation). The cycles spent doing force computation comprised 93% of all loop cycles. For the input data used, there was not enough pruning in the original code to outweigh the gains from vectorizing with Neon code despite some wasted computation.

Comparing the two vectorized versions (Neon and SVE) reveals the likely reason SVE outperformed Neon. Categorizing the instructions in the two hot loops shows:

SVE		Neon
LD1W	4	LDR	4
FSUB,FADD,FMUL,FMSB,FMLA,FMAD	21	FSUB,FADD,FMUL,FMLS,FMLA	21
FSQRT, FDIV	2	FSQRT, FDIV	2
FCMEQ, FCMGT	2	FCMEQ, FCMGT	2
MOV	1	MOV	4
MOVPRFX	5
BIC, AND, NOR	1	BIC, AND, NOR	4
ADD	1	ADD	1
WHILELO, BCC	2	CMP, BCC	2
total	39	total	40

The instruction mixes are almost the same except for differences in MOV, MOVPRFX, and the logical instructions BIC, AND, NOR. The SVE code uses a NOR to set certain predicate register bits while the Neon code uses BIC and three ANDs to mask off the vector elements that should not be modified.

In the Neon version, MOVs are used to make copies of registers that must be preserved across iterations. For SVE, the MOVPRFX provides this functionality by telling the hardware that the immediately following instruction can be converted by hardware from a destructive operation (like FMLA) into a constructive operation (like FMADD). It is only a hint, and the hardware can choose whether to treat it as a MOV or to convert it and emit micro-ops for a constructive operation. Doing so preserves the source of the MOV without the need for an explicit MOV. This conversion would typically be done in the machine front end during micro-op generation.

The extra logical instructions for Neon (BIC and AND) add more instructions and pressure to the machine as do the MOVs compared to this SVE implementation. For SVE, using MOVPRFX hints and per-element predication allows for fewer instructions. Combined, these features can eliminate a cycle or two from each loop iteration which adds up in such a hot loop.

Summary

HACCmk illustrates how SVE can vectorize loops that typically do not vectorize or are very difficult to vectorize without per-lane predication. In the example shown here, the hot loop of HACCmk was only vectorized using Neon after a rewrite knowing why vectorization failed and how to coax the compiler into vectorizing. Programmers who are not familiar with compiler internals get the benefit of vectorization in more cases with SVE without needing compiler expert rewrites.

[CTAToken URL = "https://developer.arm.com/-/media/Arm%20Developer%20Community/Images/White%20Paper%20and%20Webinar%20Images/HPC%20White%20Papers/a-sneak-peek-into-sve-and-vla-programming.pdf" target="_blank" text="Download SVE Whitepaper" class ="green"]

[1] https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf

[2] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/a-sneak-peek-into-sve-and-vla-programming

[3] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning

[4] An overview of neoverse-v1 can be found at https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/neoverse-v1-platform-a-new-performance-tier-for-arm

[5] FP add, subtract, or multiply count as 1 FLOP while FP multiply-add, multiply-subtract, or multiply-accumulate count as 2 FLOPs.

[6] There can be small differences in total FLOP count due to loop prologue and epilogue considerations.

[7] Recall, 1 Neon instruction can do the work of 4 scalar instructions.

Parents

Karol Piatek over 1 year ago

Brian, brilliant article,

trying some of those proposals on Fedora 36 on AWS Graviton 3 for cloudflare's/aws zlib.

there is small thing, gcc option is named -fopt-info-missed-vec,
order of parameters different. :)

Thanks for sharing,
Karol
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Comment

Karol Piatek over 1 year ago

Brian, brilliant article,

trying some of those proposals on Fedora 36 on AWS Graviton 3 for cloudflare's/aws zlib.

there is small thing, gcc option is named -fopt-info-missed-vec,
order of parameters different. :)

Thanks for sharing,
Karol
- Cancel
- Up 0 Down
- Reply
- More
- Cancel

Children

No Data

High Performance Computing (HPC) blog

Expanding Arm on Arm with the NVIDIA Grace CPU

Tim Thornton

In this blog post, we show how the Arm Neoverse V2-based NVIDIA Grace CPU can run Arm's most performance-critical workloads and allows Arm to operate a consistent environment in-cloud and on-prem.
- November 20, 2024
Arm Performance Libraries 24.10

Chris Goodyer

In this blog post, we review the improvements made to Arm Performance Libraries 24.10.
- November 11, 2024
Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL

David Lecomber

This blog outlines the strategies utilized to enhance Pardiso's performance by leveraging the Arm architecture and presents a comparative study with Intel MKL Pardiso.
- October 22, 2024

AI and ML blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded blog

Graphics, Gaming, and VR blog

High Performance Computing (HPC) blog

Infrastructure Solutions blog

Internet of Things (IoT) blog

Operating Systems blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

A case study in vectorizing HACCmk using SVE

Introduction

SVE predication provides more vectorization opportunities

Getting Neon to vectorize

Performance comparison of these three executables

Summary

Expanding Arm on Arm with the NVIDIA Grace CPU

Arm Performance Libraries 24.10

Optimizing the Pardiso Sparse Linear Solver on Arm Architecture by Panua Technologies: A Performance Comparison with Intel MKL