Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Servers and Cloud Computing blog A case study in vectorizing HACCmk using SVE
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • High Performance Computing (HPC)
  • Arm Compiler for Linux
  • HPC Compiler
  • NEON
  • SVE
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

A case study in vectorizing HACCmk using SVE

Brian Waldecker
Brian Waldecker
November 3, 2022
14 minute read time.

Introduction

With the introduction of Scalable Vector Extensions (SVE) by Arm as an optional extension in ARMv8-2, compiler auto-vectorizers have a choice between optimizing for SVE or Neon. Programmers can influence that choice through the gcc -march compiler flag. For example -march=armv8.2-a+sve enables SVE on Armv8.2-A and -march=armv9-a+nosve disables SVE on Armv9-A.

One important feature distinguishing SVE from Neon is predication applied to each element (lane) of a vector. By using vector predication, SVE can often vectorize loops that Neon could not. Sometimes where a loop can be vectorized using SVE or Neon, the SVE implementation is more effective. For example, SVE predication can eliminate the need for some vector compares and selects that Neon vectorization requires.

A good description of SVE and of these two key properties can be found in the IEEE Micro paper “The Arm Scalable Vector Extension” (Stephens, et. al., 2017)[1]. More detail with examples and comparisons of SVE with Neon are found in the white paper “A sneak peek into SVE and VLA programming” (F. Petrogalli, 2020)[2]. Lastly, application of SVE to machine learning is found in “Arm Scalable Vector Extensions and application to Machine Learning: (D. A. Iliescu and F. Petrogalli, 2018)[3].

This blog describes a case study in vectorizing a hot loop that appears in the HACCmk benchmark.

SVE predication provides more vectorization opportunities

Consider the code for HACCmk which was one of the benchmarks in the US government’s ASC CORAL RFP. It is an n-body code to compute the gravitational forces on one body of a group of n bodies due to the other bodies.

The computational kernel that matters in HACCmk appears in the function GravityForceKernel(…) and is shown below.

  for (int i = 0; i < n; ++i) {
    float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;
    float r2 = dx * dx + dy * dy + dz * dz;
 
    if (r2 >= MaxSepSqrd || r2 == 0.0f)
      continue;
 
    float r2s = r2 + SofteningLenSqrd;
    float f = PolyCoefficients[PolyOrder];
    for (int p = 1; p <= PolyOrder; ++p)
       f = PolyCoefficients[PolyOrder-p] + r2*f;
 
    f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];
 
    lax += f * dx;
    lay += f * dy;
    laz += f * dz;
  }
 
  ax += lax;
  ay += lay;
  az += laz;

The bolded if statement skips the force computation between pairs of bodies that are a large distance apart (force assumed to be negligible) or a body with itself. This action is a way of pruning the numbers of calculations required to speed up execution time at the expense of some precision.

In terms of loop vectorization, conditional statements inside a loop often prevent vectorization from occurring. In certain simple cases compilers can perform an if-conversion to allow the resulting loop to vectorize. If-conversion typically computes results for both the taken and non-taken paths and uses a conditional select instruction instead of having a branch, however this result is not always possible. Other times, it is possible but deemed suboptimal compared to generating non-vector code.

In this HACCmk kernel, if-conversion was deemed not beneficial by the compiler. Likely because the computation is expensive and has multiple variables which require conditional selects for each. Branching around the force computation when it is not required was deemed higher performance. As a result, the loop is not vectorizable using Neon. We can confirm this condition with the -fopt-vec-info-missed flag of gcc which prints information about vectorization attempts that failed. In this case, it gives the following reason,

<source>:21:23: missed: couldn't vectorize loop

<source>:21:23: missed: not vectorized: control flow in loop.

This code is a good example where predication in SVE can increase vectorization opportunities. Predication allows the conditional statement to be handled within a vector on a per-element basis. In other words, with SVE a predicate vector can be calculated that specifies which vector elements are updated with new force calculations and which are left unchanged.

The following code generated by gcc 12.1 when compiling for SVE and Neon. The Neon compilation was unable to vectorize it and produced scalar code).

Code compiled for Neon
(-Ofast -march=armv8.6-a+simd)
https://godbolt.org/z/98Tv5aEbo
Code compiled for SVE
(-Ofast -march=armv8.6-a+sve)
https://godbolt.org/z/hKrsG8j8b
.L4:
  ldr     s16, [x2, x0]
  ldr     s17, [x1, x0]
  ldr     s7, [x3, x0]
  fsub    s16, s16, s1
  fsub    s17, s17, s0
  fsub    s7, s7, s2
  fmul    s5, s16, s16
  fmadd   s5, s17, s17, s5
  fmadd   s5, s7, s7, s5
  fmadd   s18, s5, s22, s21
  fcmp    s5, #0.0
  fadd    s6, s5, s4
  fccmpe  s5, s3, 0, ne
  fmadd   s18, s5, s18, s20
  bge     .L3
  fsqrt   s29, s6
  fmadd   s18, s5, s18, s26
  ldr     s28, [x4, x0]
  fmsub   s5, s5, s18, s25
  fmul    s6, s6, s29
  fdiv    s6, s27, s6
  fadd    s5, s6, s5
  fmul    s5, s5, s28
  fmadd   s24, s17, s5, s24
  fmadd   s23, s16, s5, s23
  fmadd   s19, s7, s5, s19
.L3:
  add     x0, x0, 4
  cmp     x8, x0
  bne     .L4
 
.L3:
  ld1w    z3.s, p1/z, [x2, x8, lsl 2]
  ld1w    z4.s, p1/z, [x1, x8, lsl 2]
  ld1w    z2.s, p1/z, [x3, x8, lsl 2]
  fsub    z3.s, z3.s, z20.s
  fsub    z4.s, z4.s, z21.s
  fmul    z0.s, z3.s, z3.s
  fsub    z2.s, z2.s, z19.s
  fmla    z0.s, p0/m, z4.s, z4.s
  fmla    z0.s, p0/m, z2.s, z2.s
  fadd    z1.s, z0.s, z17.s
  fcmeq   p3.s, p0/z, z0.s, #0.0
  movprfx z28, z1
  fsqrt   z28.s, p0/m, z1.s
  fmul    z1.s, z1.s, z28.s
  fdivr   z1.s, p0/m, z1.s, z22.s
  fadd    z1.s, z1.s, z23.s
  movprfx z5, z26
  fmla    z5.s, p0/m, z0.s, z27.s
  fcmge   p2.s, p0/z, z0.s, z18.s
  fmad    z5.s, p0/m, z0.s, z25.s
  nor     p2.b, p0/z, p2.b, p3.b
  fmad    z5.s, p0/m, z0.s, z24.s
  and     p3.b, p1/z, p2.b, p2.b
  fmsb    z0.s, p0/m, z5.s, z1.s
  ld1w    z1.s, p3/z, [x4, x8, lsl 2]
  add     x8, x8, x9
  fmul    z0.s, z0.s, z1.s
  movprfx z4.s, p2/z, z4.s
  fmul    z4.s, p2/m, z4.s, z0.s
  movprfx z3.s, p2/z, z3.s
  fmul    z3.s, p2/m, z3.s, z0.s
  fadd    z16.s, p1/m, z16.s, z4.s
  fadd    z7.s, p1/m, z7.s, z3.s
  movprfx z2.s, p2/z, z2.s
  fmul    z2.s, p2/m, z2.s, z0.s
  fadd    z6.s, p1/m, z6.s, z2.s
  whilelo p1.s, w8, w0
  b.any   .L3
  faddv   s6, p0, z6.s
  faddv   s7, p0, z7.s
  faddv   s16, p0, z16.s

The bolded text in the assembly listings indicate the code associated with the conditional statement.

In the scalar code, the compare with 0.0 (fcmp) and the later conditional compare (fccmpe) test the two conditions that execute the continue statement and skip the rest of the loop body (bge .L3) if either is true.

For the SVE case, the predicate registers p2 and p3 handle the two conditions for skipping the force computation, one for each condition. These results are then nor-ed together into p2 that controls which vector elements have the force calculation done for them and which vector elements are left unmodified.

Getting Neon to vectorize

While the Neon compilation failed to vectorize this loop due to control flow in it, this is not always the case. In this code, the continue statement functions as a goto back to the top of the loop. Sometimes the compiler can use if-conversion to change a control dependency into a data dependency and then vectorize the loop.

Sometimes, if-conversion changes a compare and branch sequence into a conditional select of two values based on the original condition. In other cases, a compare and branch sequence is replaced by masking operations that either modify a variable or leave it unchanged.

For this code, if-conversion entails doing the force calculation in every loop iteration. It then uses a mask to either add the calculated value or a zero to lax, lay, and laz at the bottom of the loop.

Such a rewrite results in performing some floating-point calculations that would not have been done in the original code. The compiler has no way to know if these additional floating-point operations might cause exceptions that would not have occurred in the original code. In gcc, such optimizations are only done if -fno-trapping-math is used, which is included in -Ofast for gcc. So under -Ofast, gcc is allowed to do such a rewrite but did not, either because it thought it unprofitable or failed to see the opportunity.

However, rewriting the loop to explicitly do if-conversion by hand in the source code can coax the compiler to vectorize it with Advanced SIMD.

    for (int i = 0; i < n; ++i) {
        float dx = x[i] - x0, dy = y[i] - y0, dz = z[i] - z0;
        float r2 = dx * dx + dy * dy + dz * dz;
        int mask, tmp1;
 
        // remove if statement and continue statement
        float r2s = r2 + SofteningLenSqrd;
        float f = PolyCoefficients[PolyOrder];
        for (int p = 1; p <= PolyOrder; ++p)
            f = PolyCoefficients[PolyOrder-p] + r2*f;
 
        f = (1.0f / (r2s * std::sqrt(r2s)) - f) * mass[i];
 
        // calculate mask based on condition being T/F
        mask = (r2 >= MaxSepSqrd || r2 == 0.0f);
        // tmp1 assigned either f or 0 based on mask
        tmp1 = (*(unsigned int *)&f) & mask; // make f look like int
        lax += (*(float *) &tmp1) * dx; // make tmp1 look like float
        lay += (*(float *) &tmp1) * dy; // make tmp1 look like float
        laz += (*(float *) &tmp1) * dz; // make tmp1 look like float
    }
 
  ax += lax;
  ay += lay;
  az += laz;

Here the if and continue statements are gone and the values added to lax. lay, and laz will be 0 or the calculated force value, based on the same condition previously used for the continue. The value being added is determined by the mask variable.  

This results in the following Neon vector code being generated.

Code compiled for Neon after rewrite
(-Ofast -march=armv8.6-a+simd)
https://godbolt.org/z/77Pb1d6b6
        movi    v22.4s, 0x1
.L4:
        ldr     q18, [x2, x8]
        ldr     q17, [x1, x8]
        fsub    v18.4s, v18.4s, v8.4s
        ldr     q16, [x3, x8]
        fsub    v17.4s, v17.4s, v9.4s
        fmul    v5.4s, v18.4s, v18.4s
        fsub    v16.4s, v16.4s, v31.4s
        mov     v14.16b, v27.16b
        mov     v13.16b, v26.16b
        fmla    v5.4s, v17.4s, v17.4s
        ldr     q10, [x4, x8]
        add     x8, x8, 16
        fmla    v5.4s, v16.4s, v16.4s
        fadd    v12.4s, v30.4s, v5.4s
        fcmeq   v11.4s, v5.4s, 0
        fmla    v14.4s, v5.4s, v28.4s
        fcmge   v7.4s, v5.4s, v29.4s
        fsqrt   v6.4s, v12.4s
        fmla    v13.4s, v14.4s, v5.4s
        orr     v7.16b, v7.16b, v11.16b
        mov     v11.16b, v25.16b
        and     v7.16b, v22.16b, v7.16b
        fmla    v11.4s, v13.4s, v5.4s
        fmul    v6.4s, v6.4s, v12.4s
        fdiv    v6.4s, v24.4s, v6.4s
        fadd    v6.4s, v6.4s, v23.4s
        fmls    v6.4s, v11.4s, v5.4s
        fmul    v5.4s, v10.4s, v6.4s
        and     v5.16b, v7.16b, v5.16b
        fmla    v20.4s, v18.4s, v5.4s
        fmla    v21.4s, v17.4s, v5.4s
        fmla    v19.4s, v16.4s, v5.4s
        cmp     x8, x9
        bne     .L4

The force calculation is always done (even for far objects and an object with itself). In other words, the computation is no longer pruned to approximate and speed up the solution. However, the calculated value is effectively discarded (replaced by zero) when the pruning conditions are met. This is accomplished by the two compares (fcmgt and fcmeq) followed by the orr, two and instructions, and two converts between float and integer. Together these instructions zero out any vector elements where the pruning conditions were met. Then the last three fmla instructions updates laz, lay, and laz.

Performance comparison of these three executables

The performance of these three versions was evaluated using an internal cycle accurate simulator. The CPU core modeled was a Neoverse V1 and the same input dataset was used in all runs. The Neoverse V1[4] core has two 256-bit wide SVE execution units (2x256) and four 128-bit wide Advanced SIMD execution units (4x128).

For Neoverse V1, one would need twice as many Neon instructions as SVE instructions to perform a fixed amount of vectorized work. The total vector bandwidth (512 vector bits / cycle) is the same in Neoverse V1 for SVE and Neon.

The following table shows the execution times in cycles and other statistics for each of HACCmk binaries:

Original (scalar)
Vectorized Neon
Vectorized SVE
CPU_CYCLES
31,678,652
9,775,068
8,856,978
INST_RETIRED
74,230,037
23,740,887
13,390,886
FLOPs[5]/iteration (hot loop)
28 (27.33 avg)
112
224
# bytes loaded / iteration (hot loop)
16
64
128
# iterations (hot loop)
2,477,008
618,439
310,591
IPC
2.343
2.429
1.512

The Neoverse V1 can accomplish the same amount of floating-point work using two 256-bit SVE instructions per cycle or four 128-bit Neon instructions per cycle. This is a case where IPC comparisons can mislead because each SVE instruction does the work of two Neon instructions.

The simulator used can also provide execution counts per instruction address. This provides the number of iterations that the hot loop in each binary was executed. The floating-point operations (FLOPs) per iteration is calculated by examination of the disassembly. The original scalar code has 28 FLOPs in the hot loop if a static analysis is done. But due to part of the loop sometimes getting pruned (4.5% of iterations for this input dataset), the dynamic FLOPs per iteration works out to be 27.33. Multiplying the FLOPs per iteration by number of iterations shows that each binary is doing the same total amount of FP work[6].

Vectorizing the original scalar code to use Neon reduced the number of instructions needed by 65%[7]. This despite any extra instructions executed due to the vector Neon version no longer pruning the computation for very far objects or objects with themselves. Doing some wasted work and discarding the results was still beneficial since the Neon vector code reduced execution cycles by 63% from the original scalar code.

The SVE version retained the computation pruning (using predication) of the algorithm and performed a further 26% faster than the vectorized Neon version. While pruning the computation through predication likely has minimal impact on the number of instructions executed, using SVE provided a slightly different mix of instructions and resulted in fewer and shorter data dependency chains and improved instruction flow.

Having detailed cycle by cycle simulation output allows comparison of the fraction of execution cycles spent in the hot loops for each executable. The simulator provides the number of times each instruction was executed and how many cycles it waited to retire after becoming the oldest instruction (program order) in the machine. The following stats are based on these counts.

 
SVE
Original (scalar)
Vectorized Neon
Insts
Cycles
Insts
Cycles
Insts
Cycles
hot loop
12,113,049
8,390,976
72,973,140
31,236,139
24,737,560
11,245,318
entire run
13,390,886
8,856,971
74,230,037
31,678,470
26,182,871
11,734,913
Loop % of total
90.5%
94.7%
98.3%
98.6%
94.5%
95.8%

This shows approximately 95% or more of the total cycles executed were in the hot loop for all cases.

In the case or the original scalar code, the force computation is pruned when the objects are far apart. The detailed simulator output showed that 4.5% of the loop iterations were pruned (that is, 95.5% of the iterations did the force computation). The cycles spent doing force computation comprised 93% of all loop cycles. For the input data used, there was not enough pruning in the original code to outweigh the gains from vectorizing with Neon code despite some wasted computation.

Comparing the two vectorized versions (Neon and SVE) reveals the likely reason SVE outperformed Neon. Categorizing the instructions in the two hot loops shows:

SVE
Neon
LD1W
4
LDR
4
FSUB,FADD,FMUL,FMSB,FMLA,FMAD
21
FSUB,FADD,FMUL,FMLS,FMLA
21
FSQRT, FDIV
2
FSQRT, FDIV
2
FCMEQ, FCMGT
2
FCMEQ, FCMGT
2
MOV
1
MOV
4
MOVPRFX
5
BIC, AND, NOR
1
BIC, AND, NOR
4
ADD
1
ADD
1
WHILELO, BCC
2
CMP, BCC
2
total
39
total
40

The instruction mixes are almost the same except for differences in MOV, MOVPRFX, and the logical instructions BIC, AND, NOR. The SVE code uses a NOR to set certain predicate register bits while the Neon code uses BIC and three ANDs to mask off the vector elements that should not be modified.

In the Neon version, MOVs are used to make copies of registers that must be preserved across iterations. For SVE, the MOVPRFX provides this functionality by telling the hardware that the immediately following instruction can be converted by hardware from a destructive operation (like FMLA) into a constructive operation (like FMADD). It is only a hint, and the hardware can choose whether to treat it as a MOV or to convert it and emit micro-ops for a constructive operation. Doing so preserves the source of the MOV without the need for an explicit MOV. This conversion would typically be done in the machine front end during micro-op generation.

The extra logical instructions for Neon (BIC and AND) add more instructions and pressure to the machine as do the MOVs compared to this SVE implementation. For SVE, using MOVPRFX hints and per-element predication allows for fewer instructions. Combined, these features can eliminate a cycle or two from each loop iteration which adds up in such a hot loop.

Summary

HACCmk illustrates how SVE can vectorize loops that typically do not vectorize or are very difficult to vectorize without per-lane predication. In the example shown here, the hot loop of HACCmk was only vectorized using Neon after a rewrite knowing why vectorization failed and how to coax the compiler into vectorizing. Programmers who are not familiar with compiler internals get the benefit of vectorization in more cases with SVE without needing compiler expert rewrites.

Download SVE Whitepaper

[1] https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf
[2] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/a-sneak-peek-into-sve-and-vla-programming
[3] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning
[4] An overview of neoverse-v1 can be found at https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/neoverse-v1-platform-a-new-performance-tier-for-arm
[5] FP add, subtract, or multiply count as 1 FLOP while FP multiply-add, multiply-subtract, or multiply-accumulate count as 2 FLOPs.
[6] There can be small differences in total FLOP count due to loop prologue and epilogue considerations.
[7] Recall, 1 Neon instruction can do the work of 4 scalar instructions.
Anonymous
Parents
  • Karol Piatek
    Karol Piatek over 2 years ago

    Brian, brilliant article,

    trying some of those proposals on Fedora 36 on AWS Graviton 3 for cloudflare's/aws zlib.

    there is small thing, gcc option is named -fopt-info-missed-vec,
    order of parameters different. :)

    Thanks for sharing,
    Karol

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Comment
  • Karol Piatek
    Karol Piatek over 2 years ago

    Brian, brilliant article,

    trying some of those proposals on Fedora 36 on AWS Graviton 3 for cloudflare's/aws zlib.

    there is small thing, gcc option is named -fopt-info-missed-vec,
    order of parameters different. :)

    Thanks for sharing,
    Karol

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Children
No Data
Servers and Cloud Computing blog
  • Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1 Release

    Chris Goodyer
    Chris Goodyer
    In this blog post, we announce the releases of Arm Performance Libraries 25.04 and Arm Toolchain for Linux 20.1. Explore the new product features, performance highlights and how to get started.
    • June 17, 2025
  • Harness the Power of Retrieval-Augmented Generation with Arm Neoverse-powered Google Axion Processors

    Na Li
    Na Li
    This blog explores the performance benefits of RAG and provides pointers for building a RAG application on Arm®︎ Neoverse-based Google Axion Processors for optimized AI workloads.
    • April 7, 2025
  • Arm CMN S3: Driving CXL storage innovation

    John Xavier Lionel
    John Xavier Lionel
    CXL are revolutionizing the storage landscape. Neoverse CMN S3 plays a pivotal role in enabling high-performance, scalable storage devices configured as CXL Type 1 and Type 3.
    • February 24, 2025