Welcome to the ARM NEON optimization guide!
After reading the article ARM NEON programming quick reference, I believe you have a basic understanding of ARM NEON programming. But when applying ARM NEON to a real-world applications, there are many programming skills to observe.This article aims to introduce some common NEON optimization skills which come from development practice. The issue of NEON assembly and intrinsics will also be discussed.
When using NEON to optimize applications, there are some commonly used optimization skills as follows.
On the ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.
C code:
float SumSquareError_C(const float* src_a, const float* src_b, int count) { float sse = 0u; int i; for (i = 0; i < count; ++i) { float diff = src_a[i] - src_b[i]; sse += (float)(diff * diff); } return sse; }
NEON implementation 1
float SumSquareError_NEON1(const float* src_a, const float* src_b, int count) { float sse; asm volatile ( // Clear q8, q9, q10, q11 "veor q8, q8, q8 \n" "veor q9, q9, q9 \n" "veor q10, q10, q10 \n" "veor q11, q11, q11 \n" "1: \n" "vld1.32 {q0, q1}, [%[src_a]]! \n" "vld1.32 {q2, q3}, [%[src_a]]! \n" "vld1.32 {q12, q13}, [%[src_b]]! \n" "vld1.32 {q14, q15}, [%[src_b]]! \n" "subs %[count], %[count], #16 \n" // q0, q1, q2, q3 are the destination of vsub. // they are also the source of vmla. "vsub.f32 q0, q0, q12 \n" "vmla.f32 q8, q0, q0 \n" "vsub.f32 q1, q1, q13 \n" "vmla.f32 q9, q1, q1 \n" "vsub.f32 q2, q2, q14 \n" "vmla.f32 q10, q2, q2 \n" "vsub.f32 q3, q3, q15 \n" "vmla.f32 q11, q3, q3 \n" "bgt 1b \n" "vadd.f32 q8, q8, q9 \n" "vadd.f32 q10, q10, q11 \n" "vadd.f32 q11, q8, q10 \n" "vpadd.f32 d2, d22, d23 \n" "vpadd.f32 d0, d2, d2 \n" "vmov.32 %3, d0[0] \n" : "+r"(src_a), "+r"(src_b), "+r"(count), "=r"(sse) : : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13","q14", "q15"); return sse; }
NEON implementation 2
float SumSquareError_NEON2(const float* src_a, const float* src_b, int count) { float sse; asm volatile ( // Clear q8, q9, q10, q11 "veor q8, q8, q8 \n" "veor q9, q9, q9 \n" "veor q10, q10, q10 \n" "veor q11, q11, q11 \n" "1: \n" "vld1.32 {q0, q1}, [%[src_a]]! \n" "vld1.32 {q2, q3}, [%[src_a]]! \n" "vld1.32 {q12, q13}, [%[src_b]]! \n" "vld1.32 {q14, q15}, [%[src_b]]! \n" "subs %[count], %[count], #16 \n" "vsub.f32 q0, q0, q12 \n" "vsub.f32 q1, q1, q13 \n" "vsub.f32 q2, q2, q14 \n" "vsub.f32 q3, q3, q15 \n" "vmla.f32 q8, q0, q0 \n" "vmla.f32 q9, q1, q1 \n" "vmla.f32 q10, q2, q2 \n" "vmla.f32 q11, q3, q3 \n" "bgt 1b \n" "vadd.f32 q8, q8, q9 \n" "vadd.f32 q10, q10, q11 \n" "vadd.f32 q11, q8, q10 \n" "vpadd.f32 d2, d22, d23 \n" "vpadd.f32 d0, d2, d2 \n" "vmov.32 %3, d0[0] \n" : "+r"(src_a), "+r"(src_b), "+r"(count), "=r"(sse) : : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13","q14", "q15"); return sse; }
In NEON implementation 1, the destination register is used as source register immediately; In NEON implementation 2, instructions are rescheduled and given the latency as much as possible. The test result indicates that implementation 2 is ~30% faster than implementation 1. Thus, reducing data dependencies can improve performance significantly. A good news is that compiler can fine-tune NEON intrinsics automatically to avoid data dependencies which is really one of the big advantages.
Note: this test runs on Cortex-A9. The result may be different on other platforms.
There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.
Example:
C implementation
if( flag ) { dst[x * 4] = a; dst[x * 4 + 1] = a; dst[x * 4 + 2] = a; dst[x * 4 + 3] = a; } else { dst[x * 4] = b; dst[x * 4 + 1] = b; dst[x * 4 + 2] = b; dst[x * 4 + 3] = b; }
NEON implementation
//dst[x * 4] = (a&Eflag) | (b&~Eflag); //dst[x * 4 + 1] = (a&Eflag) | (b&~Eflag); //dst[x * 4 + 2] = (a&Eflag) | (b&~Eflag); //dst[x * 4 + 3] = (a&Eflag) | (b&~Eflag); VBSL qFlag, qA, qB
ARM NEON instruction set provides the instructions as follows to help users implement the logical operation above:
Reducing branches is not specific to NEON only. It is a commonly used trick. Even in a C program, this trick is also worth the effort.
ARM processors are a load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application.
Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance.
PLD syntax:
PLD{cond} [Rn {, #offset}] PLD{cond} [Rn, +/-Rm {, shift}] PLD{cond} label
Where:
Cond - is an optional condition code.Rn - is the register on which the memory address is based.Offset - is an immediate offset. If offset is omitted, the address is the value in Rn.Rm - contains an offset value and must not be PC (or SP, in Thumb state).Shift - is an optional shift.Label - is a PC-relative expression.
The PLD operation features:
In ARM NEON programming, Different instruction sequences can be used to perform the same operation. But fewer instructions do not always produce better performance. It depends on benchmark and profiling result of specific cases. Below listed are some special cases in development practice.
This example is specific for Cortex-A9. For other platforms, the result needs to be verified again.
Usually, VMUL+VADD/VMUL+VSUB can be replaced by VMLA/VMLS because fewer instructions are used. But compared to floating-point VMUL, floating-point VMLA/VMLS has a longer instruction delay. If there aren’t other instructions that can be inserted into delay slot, using floating-point VMUL+VADD/VMUL+VSUB will show a better performance.
VMUL+VADD/VMUL+VSUB
VMLA/VMLS
VMUL
A real-world example is floating-point FIR function in Ne10. The code snippets are as follows:
Implementation 1: there is only one instruction “VEXT” between two “VMLA” which needs 9 execution cycles according to the table of NEON floating-point instructions timing.
VEXT
VMLA
VEXT qTemp1,qInp,qTemp,#1 VMLA qAcc0,qInp,dCoeff_0[0] VEXT qTemp2,qInp,qTemp,#2 VMLA qAcc0,qTemp1,dCoeff_0[1] VEXT qTemp3,qInp,qTemp,#3 VMLA qAcc0,qTemp2,dCoeff_1[0] VMLA qAcc0,qTemp3,dCoeff_1[1]
Implementation 2: there is still data dependency on qAcc0. But VADD/VMUL needs 5 execution cycles only.
VEXT qTemp1,qInp,qTemp,#1 VMLA qAcc0,qInp,dCoeff_0[0] ] VMUL qAcc1,qTemp1,dCoeff_0[1] VEXT qTemp2,qInp,qTemp,#2 VMUL qAcc2,qTemp2,dCoeff_1[0] VADD qAcc0, qAcc0, qAcc1 VEXT qTemp3,qInp,qTemp,#3 VMUL qAcc3,qTemp3,dCoeff_1[1] VADD qAcc0, qAcc0, qAcc2 VADD qAcc0, qAcc0, qAcc3
The benchmark shows that implementation 2 has a better performance.
For more code details, see GitHub.
NEON floating-point instructions timing:
The table is from appendix[ii]。
NEON optimization techniques are summarized as follows:
In “ARM NEON programming quick guide”, there is a simple comparison of the pros and cons of NEON assembly and intrinsics:
But the reality is far more complex than many of these, especially when it comes to ARMv7-A/v8-A cross-platform. In the following sections, this issue will be analyzed further with some examples.
For NEON beginners, intrinsics are easier to use than assembly. But experienced developers may be more familiar with NEON assembly programming. They need time to adapt to the intrinsics coding. Some issues that may occur in real development are described in the following.
From the perspective of using instructions, assembly instruction is more flexible than intrinsics. It is mainly reflected in the data load / store.
This issue will be fixed with upgrading the compiler in the future. Sometimes, the compiler has been able to translate two intrinsics instructions into one assembly instruction, such as:
Therefore, it’s expected that intrinsics instruction will have the same flexibility with assembly instructions, with the upgrade of the ARMv8 toolchain.
When programming in NEON assembly, registers have to be allocated by users. It must be known clearly which registers are occupied. One of benefits of programming in intrinsics is that users only need to define variables. The compiler will allocate registers automatically. This is an advantage, but it might be a weakness in some cases. Practice has proved that using too many NEON registers simultaneously in the intrinsics programming will bring gcc compiler register allocation issue. When this issue happens, many data are pushed into the stack, which will greatly affect performance of the program. Therefore, users should pay attention to this issue when intrinsics programming. When there is performance exception (such as the performance of C is better than it of NEON), you need to check the disassembly to see if there is register allocation issue firstly. For ARMv8-A AArch64, there are more registers (32 128-bit NEON registers). The impact of this issue is significantly reduced.
On one platform, the performance of NEON assembly is only decided by implementation, and has nothing to do with the compiler. The benefit is that you can predict and control the performance of program when you hand-tune the codes, but there isn’t surprise.
Conversely, the performance of NEON intrinsics is greatly dependent on the compiler used. Different compilers may bring very different performance. In general, the older the compiler, the worse the performance. When compatibility with older compilers is needed, it must be considered carefully whether the intrinsics will fit the need. In addition, when fine-tuning the code, it’s hard for user to predict the change of performance with the intervention of the compiler. But there may be surprise. Sometimes the intrinsics might bring better performance than assembly. This is very rare, but does occur.
Compiler will have an impact on the process of NEON optimization. The following figure describes the general process of NEON implementation and optimization.
NEON assembly and intrinsics have the same process of implementation, coding - debugging – performance test. But they have different process of optimization step.The methods of assembly fine-tuning are:
When fine-tuning assembly, a sophisticated way is that:
The disadvantage of this approach is that the changes are specific to one micro-architecture. When the platform is switched, performance improvement achieved might be lost. This is also very time-consuming to do for often comparatively small gains.Fine-tuning of NEON intrinsics is more difficult.
When porting the assembly code of ARMv7-A with intrinsics for ARMv7-A/v8-A compatibility, performance of assembly can be used as a reference of performance. So it is easy to check whether the work is done. However, when intrinsics are used to optimize ARMv8-A code, there is not a performance reference. It is difficult to determine whether the performance is optimal. Based on the experience on ARMv7-A, there might be a doubt whether the assembly has the better performance. I think the impact of this issue will become smaller and smaller with the maturity of the ARMv8-A environment.
Now, most of the existing NEON assembly codes can only run on the platforms of ARMv7-A/ARMv8-A AArch32 mode. If you want to run them on platforms of ARMv8-A AArch64 mode, you must rewrite these codes, which take a lot of work. In such situation, if the codes are programmed with NEON intrinsics, they can be run directly on platforms of ARMv8-A AArch64 mode. Cross-platform is one of great advantages. Meanwhile, you just need to maintain on set of code for different platform by using intrinsics, which also significantly reduces the maintenance effort. However, due to the different hardware resources on ARMv7-A/ARMv8-A platform, sometimes there still might be two sets of code even with intrinsics. The FFT implementation in Ne10 project is an example:
// radix 4 butterfly with twiddlesscratch[0].r = scratch_in[0].r;scratch[0].i = scratch_in[0].i;scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;
The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, the following can be concluded:
And, for ARMv7-A/v8-A AArch32 and v8-A AArch64,
Considering the above factors, the FFT implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARMv7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARMv8-A AArch64.The above example can illustrate that you need to pay attention to some exceptions when maintaining one set of code across ARMv7-A/v8-A platform,
Many issues about using NEON assembly and intrinsics have been discussed. But these issues are temporary. In the long term, intrinsics will be better. By using intrinsics, you can reap the benefits of hardware and compiler upgrade without reprogramming. That means some classical algorithms just need to be implemented once.
The compiler will help to adjust these codes for new hardware, which reduces the workload significantly. Pffft is an example.The following figure describes the performance of pffft and Ne10 real FFT on the ARM Cortex-A9 platform with gcc. X-axis represents the length of FFT. Y-axis represents the time of execution, the smaller the better. Pffft is implemented with NEON intrinsics. Ne10 real FFT is implemented with NEON assembly. They don’t use the same algorithm, but they have similar performance.
Hi, I am looking for FFT code that has neon intrinsics. I have looked into PFFFT but it does not have any ARM NEON intrinsics in code.
Actually, I have code that translates NEON intrinsics to x86 SSE using gcc-x86 compiler and custom header files instead of arm_neon.h. But translating the Ne10 is not possible as it will contain code other than the NEON intrinsics that are specific to ARM architecture. Can you suggest me a FFT code with NEON intrinsics other than Ne10?