ARM NEON optimization

ARM NEON optimization

1 Introduction

After reading the article ARM NEON programming quick reference, I believe you have a basic understanding of ARM NEON programming.But when applying ARM NEON to a real-worldapplication, there are many programming skills to observe.This article aims to introduce somecommon NEON optimization skills which come from development practice. The issue of NEON assembly and intrinsics will also be discussed.

2 NEON optimization skills

When using NEON to optimize applications, there are some commonly used optimization skills as follows.

2.1 Remove data dependencies

On ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.

The example:

C code

float SumSquareError_C(const float* src_a, const float* src_b, int count)

{

  float sse = 0u;

  int i;

  for (i = 0; i < count; ++i) {

    float diff = src_a[i] - src_b[i];

    sse += (float)(diff * diff);

  }

  return sse;

}

NEON implementation 1

NEON implementation 2

float SumSquareError_NEON1(const float* src_a, const float* src_b, int count)

{

  float sse;

  asm volatile (

    // Clear q8, q9, q10, q11

    "veor    q8, q8, q8                            \n"

    "veor    q9, q9, q9                            \n"

    "veor    q10, q10, q10                     \n"

    "veor    q11, q11, q11                     \n"

  "1:                                                           \n"

    "vld1.32     {q0, q1}, [%[src_a]]!       \n"

    "vld1.32     {q2, q3}, [%[src_a]]!       \n"

    "vld1.32     {q12, q13}, [%[src_b]]!  \n"

    "vld1.32     {q14, q15}, [%[src_b]]!  \n"

"subs %[count], %[count], #16  \n"

// q0, q1, q2, q3 are the destination of vsub.

// they are also the source of vmla.

    "vsub.f32 q0, q0, q12                      \n"

    "vmla.f32   q8, q0, q0                        \n"

    "vsub.f32   q1, q1, q13                      \n"

    "vmla.f32   q9, q1, q1                       \n"

    "vsub.f32   q2, q2, q14                    \n"

    "vmla.f32   q10, q2, q2                    \n"

    "vsub.f32   q3, q3, q15                    \n"

    "vmla.f32   q11, q3, q3                    \n"

    "bgt        1b                                        \n"

    "vadd.f32   q8, q8, q9                      \n"

    "vadd.f32   q10, q10, q11               \n"

    "vadd.f32   q11, q8, q10                 \n"

    "vpadd.f32  d2, d22, d23                \n"

    "vpadd.f32  d0, d2, d2                     \n"

    "vmov.32    %3, d0[0]                      \n"

    : "+r"(src_a),

      "+r"(src_b),

      "+r"(count),

      "=r"(sse)

    :

    : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",

      "q12", "q13","q14", "q15");

  return sse;

}

float SumSquareError_NEON2(const float* src_a, const float* src_b, int count)

{

  float sse;

  asm volatile (

    // Clear q8, q9, q10, q11

    "veor    q8, q8, q8                            \n"

    "veor    q9, q9, q9                            \n"

    "veor    q10, q10, q10                     \n"

    "veor    q11, q11, q11                     \n"

  "1: \n"

    "vld1.32     {q0, q1}, [%[src_a]]!       \n"

    "vld1.32     {q2, q3}, [%[src_a]]!       \n"

    "vld1.32     {q12, q13}, [%[src_b]]!  \n"

    "vld1.32     {q14, q15}, [%[src_b]]!  \n"

    "subs       %[count], %[count], #16  \n"

    "vsub.f32 q0, q0, q12                      \n"

    "vsub.f32   q1, q1, q13                     \n"

    "vsub.f32   q2, q2, q14                     \n"

    "vsub.f32   q3, q3, q15                     \n"

    "vmla.f32   q8, q0, q0                      \n"

    "vmla.f32   q9, q1, q1                      \n"

    "vmla.f32   q10, q2, q2                    \n"

    "vmla.f32   q11, q3, q3                    \n"

    "bgt        1b                                         \n"

    "vadd.f32   q8, q8, q9                      \n"

    "vadd.f32   q10, q10, q11                \n"

    "vadd.f32   q11, q8, q10                  \n"

    "vpadd.f32  d2, d22, d23                 \n"

    "vpadd.f32  d0, d2, d2                      \n"

    "vmov.32    %3, d0[0]                       \n"

    : "+r"(src_a),

      "+r"(src_b),

      "+r"(count),

      "=r"(sse)

    :

    : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",

      "q12", "q13","q14", "q15");

  return sse;

}

In NEON implementation 1, the destination register is used as source register immediately; In NEON implementation 2, instructions are rescheduled and given the latency as much as possible. The test result indicates that implementation 2 is ~30% faster than implementation 1. Thus, reducing data dependencies can improve performance significantly. A good news is that compiler can fine-tune NEON intrinsics automatically to avoid data dependencies which is really one of the big advantages.

Note: this test runs on Cortex-A9. The result may be different on other platforms.

2.2 Reduce branches

There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.

The example:

C implementation

NEON implementation

if( flag )

{

        dst[x * 4]       = a;

        dst[x * 4 + 1] = a;

        dst[x * 4 + 2] = a;

        dst[x * 4 + 3] = a;

}

else

{

        dst[x * 4]       = b;

        dst[x * 4 + 1] = b;

        dst[x * 4 + 2] = b;

        dst[x * 4 + 3] = b;

}

//dst[x * 4]       = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 1] = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 2] = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 3] = (a&Eflag) | (b&~Eflag);

VBSL qFlag, qA, qB

ARM NEON instruction set provides the instructions as follows to help users implement the logical operation above:

  • VCEQ, VCGE, VCGT, VCLE, VCLT……
  • VBIT, VBIF, VBSL……

Reducing branches is not specific to NEON only. It is a commonly used trick. Even in a C program, this trick is also worth the effort.

2.3 Preload data-PLD[i]

ARM processor is load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application.

Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance.

PLD syntax:

    PLD{cond} [Rn {, #offset}]

    PLD{cond} [Rn, +/-Rm {, shift}]

    PLD{cond} label

Where:

Cond - is an optional condition code.

Rn - is the register on which the memory address is based.

Offset - is an immediate offset. If offset is omitted, the address is the value in Rn.

Rm - contains an offset value and must not be PC (or SP, in Thumb state).

Shift - is an optional shift.

Label - is a PC-relative expression.

The PLD operation features:

  • Independent of load and store instruction execution
  • Happens in the background while the processor continues to execute other instructions.
  • The offset is specified to real cases.

2.4 Misc

In ARM NEON programming, Different instruction sequences can be used to perform the same operation. But fewer instructions do not always produce better performance. It depends on benchmark and profiling result of specific cases. Below listed are some special cases in development practice.

2.4.1 Floating-point VMLA/VMLS instruction

This example is specific for Cortex-A9. For other platforms, the result needs to be verified again.

Usually, VMUL+VADD/VMUL+VSUB can be replaced by VMLA/VMLS because fewer instructions are used. But compared to floating-point VMUL, floating-point VMLA/VMLS has a longer instruction delay. If there aren’t other instructions that can be inserted into delay slot, using floating-point VMUL+VADD/VMUL+VSUB will show a better performance.

A real-world example is floating-point FIR function in Ne10. The code snippets are as follows:

Implementation 1: there is only one instruction “VEXT” between two “VMLA” which needs 9 execution cycles according to the table of NEON floating-point instructions timing.

Implementation 2: there is still data dependency on qAcc0. But VADD/VMUL needs 5 execution cycles only.

The benchmark shows that implementation 2 has a better performance.

Implementation 1: VMLA

VEXT qTemp1,qInp,qTemp,#1

VMLA qAcc0,qInp,dCoeff_0[0]

VEXT qTemp2,qInp,qTemp,#2

VMLA qAcc0,qTemp1,dCoeff_0[1]

VEXT qTemp3,qInp,qTemp,#3

VMLA qAcc0,qTemp2,dCoeff_1[0]

VMLA qAcc0,qTemp3,dCoeff_1[1]

Implementation 2: VMUL+VADD

VEXT qTemp1,qInp,qTemp,#1

VMLA qAcc0,qInp,dCoeff_0[0] ]

VMUL qAcc1,qTemp1,dCoeff_0[1]

VEXT qTemp2,qInp,qTemp,#2

VMUL qAcc2,qTemp2,dCoeff_1[0]

VADD qAcc0, qAcc0, qAcc1

VEXT qTemp3,qInp,qTemp,#3

VMUL qAcc3,qTemp3,dCoeff_1[1]

VADD qAcc0, qAcc0, qAcc2

VADD qAcc0, qAcc0, qAcc3

For more code details, see:

https://github.com/projectNe10/Ne10/commit/97c162781c83584851ea3758203f9d2aa46772d5?diff=split: modules/dsp/NE10_fir.neon.sline 195

NEON floating-point instructions timing:

Name

Format

Cycles

Result

VADD

VSUB

VMUL

Dd,Dn,Dm

1

5

Qd,Qn,Dm

2

5

VMLA

VMLS

Dd,Dn,Dm

1

9

Qd,Qn,Dm

2

9

The table is from appendix[ii]

Where:

  • Cycles: instruction issue
  • Result: instruction execute

2.5 Summary

NEON optimization techniques are summarized as follows:

  • Utilize the delay slot of instruction as much as possible.
  • Avoid branches.
  • Pay attention to cache hit.

3 NEON assembly and intrinsics

In “ARM NEON programming quick guide”, there is a simple comparison of the pros and cons of NEON assembly and intrinsics:

NEON assembly

NEON intrinsic

Performance

Always shows the best performance for the specified platform for an experienced developer.

Depends heavily on the toolchain that is used

Portability

The different ISA (i.e. ARMv7-A/v8-A AArch32 and ARMv8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures.

Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.

Maintainability

Hard to read/write compared with C.

Similar to C code, it’s easy to read/write.

But the reality is far more complex than many of these, especially when it comes to ARMv7-A/v8-A cross-platform. In the following sections, this issue will be analyzed further with some examples.

3.1 Programming

For NEON beginners, intrinsics are easier to use than assembly. But experienced developers may be more familiar with NEON assembly programming. They need time to adapt to the intrinsics coding. Some issues that may occur in real development are described in the following.

3.1.1 Flexibility of instruction

From the perspective of using instructions, assembly instruction is more flexible than intrinsics. It is mainly reflected in the data load / store.

Example:

Intrinsics instruction

Load data into a 64-bit register, vld1_s8/u8/s16/u16/s32…etc

Load data into a 128-bit register, vld1q_s8/u8/s16/u16/s32…etc

ARMv7-A assembly

VLD1 { Dd}, [Rn]

VLD1 { Dd, D(d+1) }, [Rn]

VLD1 { Dd, D(d+1), D(d+2)}, [Rn]

VLD1 { Dd, D(d+1), D(d+2), D(d+3) }, [Rn]

ARMv8-A assembly

LD1 { <Vt>.<T}, [<Xn|SP>]

LD1 { <Vt>.<T>, <Vt2>.<T>}, [<Xn|SP>]

LD1 { <Vt>.<T>, <Vt2>.<T>, <Vt3>.<T> }, [<Xn|SP>]

LD1 { <Vt>.<T>, <Vt2>.<T>, <Vt3>.<T>, <Vt4>.<T> }, [<Xn|SP>]

This issue will be fixed with upgrading compiler in the future.
Sometimes, compiler has been able to translate two intrinsics instructions into one assembly instruction, such as:

Therefore, it’s expected that intrinsics instruction will have the same flexibility with assembly instructions, with the upgrade of ARMv8 toolchain.

3.1.2 Register allocation

When programming in NEON assembly, registers have to be allocated by users. It must be known clearly which registers are occupied. One of benefits of programming in intrinsics is that users only need to define variables. The compiler will allocate registers automatically. This is an advantage, but it might be a weakness in some cases. Practice has proved that using too many NEON registers simultaneously in the intrinsics programming will bring gcc compiler register allocation issue. When this issue happens, many data are pushed into the stack, which will greatly affect performance of the program. Therefore, users should pay attention to this issue when intrinsics programming. When there is performance exception (such as the performance of C is better than it of NEON), you need to check the disassembly to see if there is register allocation issue firstly. For ARMv8-A AArch64, there are more registers (32 128-bit NEON registers). The impact of this issue is significantly reduced.

3.2 Performance and compiler

On one platform, the performance of NEON assembly is only decided by implementation, and has nothing to do with the compiler. The benefit is that you can predict and control the performance of program when you hand-tune the codes, but there isn’t surprise.

Conversely, the performance of NEON intrinsics is greatly dependent on the compiler used. Different compilers may bring very different performance. In general, the older the compiler, the worse the performance. When compatibility with older compilers is needed, it must be considered carefully whether the intrinsics will fit the need. In addition, when fine-tuning the code, it’s hard for user to predict the change of performance with the intervention of the compiler. But there may be surprise. Sometimes the intrinsics might bring better performance than assembly. This is very rare, but does occur.

Compiler will have an impact on the process of NEON optimization. The following figure describes the general process of NEON implementation and optimization.

 

NEON assembly and intrinsics have the same process of implementation, coding - debugging performance test. But they have different process of optimization step.

The methods of assembly fine-tuning are:

  • Change implementations, such as changing the instructions, adjusting the parallelism.
  • Adjust the instruction sequence to reduce data dependency.
  • Try the skills described in section 2.

When fine-tuning assembly, a sophisticated way is that:

  • Know the number of used instructions exactly
  • Get the execution cycles of program by using the PMU (Performance Monitoring Unit).
  • Adjust the sequence of instructions based on the timing of used instructions. And try to minimize the instruction delay as possible as you can

The disadvantage of this approach is that the changes are specific to one micro-architecture. When the platform is switched, performance improvement achieved might be lost. This is also very time-consuming to do for often comparatively small gains.

Fine-tuning of NEON intrinsics is more difficult.

  • Try the methods used in NEON assembly optimization.
  • Look at the generated assembly and check the data dependencies and register usage.
  • Check whether the performance meets the expectation. If yes, the work of optimization is done. Then the performance with other compilers needs to be verified again.

When porting the assembly code of ARMv7-A with intrinsics for ARMv7-A/v8-A compatibility, performance of assembly can be used as a reference of performance. So it’s easy to check whether the work is done.  However, when intrinsics are used to optimize ARMv8-A code, there isn’t a performance reference. It’s difficult to determine whether the performance is optimal. Based on the experience on ARMv7-A, there might be a doubt whether the assembly has the better performance. I think the impact of this issue will become smaller and smaller with the maturity of the ARMv8-A environment.

3.3 Cross-platform and portability

Now, most of the existing NEON assembly codes can only run on the platforms of ARMv7-A/ARMv8-A AArch32 mode. If you want to run them on platforms of ARMv8-A AArch64 mode, you must rewrite these codes, which take a lot of work. In such situation, if the codes are programmed with NEON intrinsics, they can be run directly on platforms of ARMv8-A AArch64 mode. Cross-platform is one of great advantages. Meanwhile, you just need to maintain on set of code for different platform by using intrinsics, which also significantly reduces the maintenance effort.

However, due to the different hardware resources on ARMv7-A/ARMv8-A platform, sometimes there still might be two sets of code even with intrinsics. The FFT implementation in Ne10 project is an example:

// radix 4 butterfly with twiddles

scratch[0].r = scratch_in[0].r;

scratch[0].i = scratch_in[0].i;

scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;

scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;

scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;

scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;

scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;

scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, the following can be concluded:

  • 20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
  • 20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARMv7-A/v8-A AArch32 and v8-A AArch64,

  • There are 32 64-bit or 16 128-bit NEON registers for ARMv7-A/v8-A AArch32.
  • There are 32 128-bit NEON registers for ARMv8-A AArch64.

Considering the above factors, the FFT implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARMv7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARMv8-A AArch64.

The above example can illustrate that you need to pay attention to some exceptions when maintaining one set of code across ARMv7-A/v8-A platform,

3.4 Future

Many issues about using NEON assembly and intrinsics have been discussed. But these issues are temporary. In the long term, intrinsics will be better. By using intrinsics, you can reap the benefits of hardware and compiler upgrade without reprogramming. That means some classical algorithms just need to be implemented once. The compiler will help to adjust these codes for new hardware, which reduces the workload significantly. Pffft is an example.

The following figure describes the performance of pffft and Ne10 real FFT on the ARM Cortex-A9 platform with gcc. X-axis represents the length of FFT. Y-axis represents the time of execution, the smaller the better. Pffft is implemented with NEON intrinsics. Ne10 real FFT is implemented with NEON assembly. They don’t use the same algorithm, but they have similar performance.

  

In the ARMv8-A AArch64 mode, the Ne10 real FFT is rewritten with both NEON assembly and intrinsics. Section 3.3 has explained that ARMv8-A can process 4 butterflies in parallel, but ARMv7-A can only process 2 butterflies in parallel. So theoretically, the effect of FFT optimization on ARMv8-A would be better than that on ARMv7-A. However, based on the following figure, it’s clear that pffft has the best performance.  The result illustrates that the compiler should have done very good optimizations specified for ARMv8 architecture.

  

From this example, it is concluded that: pffft, the performance of which isn’t the best on the ARMv7-A, shows a very good performance in the ARMv8-A AArch64 mode. That proves the point: compiler can adjust intrinsics automatically for ARMv8-A to achieve a good performance.

In the long run, existing NEON assembly code have to be rewritten for ARMv8. If NEON is upgraded in the future, the code has to be rewritten again and again. But for NEON intrinsics code, it is expected that it may show good performance on ARMv8-A with the help of compilers. Even if NEON is upgraded, you can also look forward to the upgrade of compilers.

3.5 Summary

In this section, the pros and cons of NEON assembly and intrinsics have been analyzed with some examples. The benefits of intrinsics far outweigh drawbacks. Compared to assembly, Intrinsics are more easily programmed, and have better compatibility between ARMv7 and ARMv8.

Some tips for NEON intrinsics are summarized as follows:

  • The number of registers used
  • Compiler
  • Do look at the generated assembly

4 End

This blog mainly discusses some common NEON optimization skills and analyzes the pros and cons of NEON assembly and intrinsics with some examples. Hope that these can help NEON developers in actual development.


[i] Cortex™-A Series Version: 2.0 Programmer’s Guide: 17-12 and 17-20

[ii] Cortex™-A9 NEON™ Media Processing Engine Revision: r4p1 Technical Reference Manual: 3.4.8

Anonymous
  • Hi, I am looking for FFT code that has neon intrinsics. I have looked into PFFFT but it does not have any ARM NEON intrinsics in code.

    Actually, I have code that translates NEON intrinsics to x86 SSE using gcc-x86 compiler and custom header files instead of arm_neon.h. But translating the Ne10 is not possible as it will contain code other than the NEON intrinsics that are specific to ARM architecture. Can you suggest me a FFT code with NEON intrinsics other than Ne10?