Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Operating Systems blog ARM NEON optimization
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tags
  • Assembly
  • optimization
  • NEON
  • Tutorial
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

ARM NEON optimization

Yang Zhang 张洋
Yang Zhang 张洋
March 27, 2015
14 minute read time.

Welcome to the ARM NEON optimization guide!

1. Introduction

After reading the article ARM NEON programming quick reference, I believe you have a basic understanding of ARM NEON programming. But when applying ARM NEON to a real-world applications, there are many programming skills to observe.This article aims to introduce some common NEON optimization skills which come from development practice. The issue of NEON assembly and intrinsics will also be discussed.

2. NEON optimization skills

When using NEON to optimize applications, there are some commonly used optimization skills as follows.

2.1. Remove data dependencies

On the ARMv7-A platform, NEON instructions usually take more cycles than ARM instructions. To reduce instruction latency, it’s better to avoid using the destination register of current instruction as the source register of next instruction.

Example:

C code:

float SumSquareError_C(const float* src_a, const float* src_b, int count)

{

  float sse = 0u;

  int i;

  for (i = 0; i < count; ++i) {

    float diff = src_a[i] - src_b[i];

    sse += (float)(diff * diff);

  }

  return sse;

}

NEON implementation 1

float SumSquareError_NEON1(const float* src_a, const float* src_b, int count)

{

  float sse;

  asm volatile (

    // Clear q8, q9, q10, q11

    "veor    q8, q8, q8                            \n"

    "veor    q9, q9, q9                            \n"

    "veor    q10, q10, q10                     \n"

    "veor    q11, q11, q11                     \n"

  "1:                                                           \n"

    "vld1.32     {q0, q1}, [%[src_a]]!       \n"

    "vld1.32     {q2, q3}, [%[src_a]]!       \n"

    "vld1.32     {q12, q13}, [%[src_b]]!  \n"

    "vld1.32     {q14, q15}, [%[src_b]]!  \n"

"subs %[count], %[count], #16  \n"

// q0, q1, q2, q3 are the destination of vsub.

// they are also the source of vmla.

    "vsub.f32 q0, q0, q12                      \n"

    "vmla.f32   q8, q0, q0                        \n"

    "vsub.f32   q1, q1, q13                      \n"

    "vmla.f32   q9, q1, q1                       \n"

    "vsub.f32   q2, q2, q14                    \n"

    "vmla.f32   q10, q2, q2                    \n"

    "vsub.f32   q3, q3, q15                    \n"

    "vmla.f32   q11, q3, q3                    \n"

    "bgt        1b                                        \n"

    "vadd.f32   q8, q8, q9                      \n"

    "vadd.f32   q10, q10, q11               \n"

    "vadd.f32   q11, q8, q10                 \n"

    "vpadd.f32  d2, d22, d23                \n"

    "vpadd.f32  d0, d2, d2                     \n"

    "vmov.32    %3, d0[0]                      \n"

    : "+r"(src_a),

      "+r"(src_b),

      "+r"(count),

      "=r"(sse)

    :

    : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",

      "q12", "q13","q14", "q15");

  return sse;

}

NEON implementation 2

float SumSquareError_NEON2(const float* src_a, const float* src_b, int count)

{

  float sse;

  asm volatile (

    // Clear q8, q9, q10, q11

    "veor    q8, q8, q8                            \n"

    "veor    q9, q9, q9                            \n"

    "veor    q10, q10, q10                     \n"

    "veor    q11, q11, q11                     \n"

  "1: \n"

    "vld1.32     {q0, q1}, [%[src_a]]!       \n"

    "vld1.32     {q2, q3}, [%[src_a]]!       \n"

    "vld1.32     {q12, q13}, [%[src_b]]!  \n"

    "vld1.32     {q14, q15}, [%[src_b]]!  \n"

    "subs       %[count], %[count], #16  \n"

    "vsub.f32 q0, q0, q12                      \n"

    "vsub.f32   q1, q1, q13                     \n"

    "vsub.f32   q2, q2, q14                     \n"

    "vsub.f32   q3, q3, q15                     \n"

    "vmla.f32   q8, q0, q0                      \n"

    "vmla.f32   q9, q1, q1                      \n"

    "vmla.f32   q10, q2, q2                    \n"

    "vmla.f32   q11, q3, q3                    \n"

    "bgt        1b                                         \n"

    "vadd.f32   q8, q8, q9                      \n"

    "vadd.f32   q10, q10, q11                \n"

    "vadd.f32   q11, q8, q10                  \n"

    "vpadd.f32  d2, d22, d23                 \n"

    "vpadd.f32  d0, d2, d2                      \n"

    "vmov.32    %3, d0[0]                       \n"

    : "+r"(src_a),

      "+r"(src_b),

      "+r"(count),

      "=r"(sse)

    :

    : "memory", "cc", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11",

      "q12", "q13","q14", "q15");

  return sse;

}

In NEON implementation 1, the destination register is used as source register immediately; In NEON implementation 2, instructions are rescheduled and given the latency as much as possible. The test result indicates that implementation 2 is ~30% faster than implementation 1. Thus, reducing data dependencies can improve performance significantly. A good news is that compiler can fine-tune NEON intrinsics automatically to avoid data dependencies which is really one of the big advantages.

Note: this test runs on Cortex-A9. The result may be different on other platforms.

2.2 Reduce branches

There isn’t branch jump instruction in NEON instruction set. When the branch jump is needed, jump instructions of ARM are used. In ARM processors, branch prediction techniques are widely used. But once the branch prediction fails, the punishment is rather high. So it’s better to avoid the using jump instructions. In fact, logical operations can be used to replace branch in some cases.

Example:

C implementation

if( flag )

{

        dst[x * 4]       = a;

        dst[x * 4 + 1] = a;

        dst[x * 4 + 2] = a;

        dst[x * 4 + 3] = a;

}

else

{

        dst[x * 4]       = b;

        dst[x * 4 + 1] = b;

        dst[x * 4 + 2] = b;

        dst[x * 4 + 3] = b;

}

NEON implementation

//dst[x * 4]       = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 1] = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 2] = (a&Eflag) | (b&~Eflag);

//dst[x * 4 + 3] = (a&Eflag) | (b&~Eflag);

VBSL qFlag, qA, qB

ARM NEON instruction set provides the instructions as follows to help users implement the logical operation above:

  • VCEQ, VCGE, VCGT, VCLE, VCLT
  • VBIT, VBIF, VBSL

Reducing branches is not specific to NEON only. It is a commonly used trick. Even in a C program, this trick is also worth the effort.

2.3 Preload data-PLD[i]

ARM processors are a load/store system. Except load/store instructions, all operations perform on registers. Therefore increasing the efficiency of load/store instructions is very important for optimizing application.

Preload instruction allows the processor to signal the memory system that a data load from an address is likely in the near future. If the data is preloaded into cache correctly, it would be helpful to improve the rate of cache hit which can boost performance significantly. But the preload is not a panacea. It’s very hard to use on recent processors and it can be harmful too. A bad preload will reduce performance.

PLD syntax:

    PLD{cond} [Rn {, #offset}]

    PLD{cond} [Rn, +/-Rm {, shift}]

    PLD{cond} label

Where:

Cond - is an optional condition code.
Rn - is the register on which the memory address is based.
Offset - is an immediate offset. If offset is omitted, the address is the value in Rn.
Rm - contains an offset value and must not be PC (or SP, in Thumb state).
Shift - is an optional shift.
Label - is a PC-relative expression.

The PLD operation features:

  • Independent of load and store instruction execution
  • Happens in the background while the processor continues to execute other instructions.
  • The offset is specified to real cases.

2.4 Misc

In ARM NEON programming, Different instruction sequences can be used to perform the same operation. But fewer instructions do not always produce better performance. It depends on benchmark and profiling result of specific cases. Below listed are some special cases in development practice.

2.4.1 Floating-point VMLA/VMLS instruction

This example is specific for Cortex-A9. For other platforms, the result needs to be verified again.

Usually, VMUL+VADD/VMUL+VSUB can be replaced by VMLA/VMLS because fewer instructions are used. But compared to floating-point VMUL, floating-point VMLA/VMLS has a longer instruction delay. If there aren’t other instructions that can be inserted into delay slot, using floating-point VMUL+VADD/VMUL+VSUB will show a better performance.

A real-world example is floating-point FIR function in Ne10. The code snippets are as follows:

Implementation 1: there is only one instruction “VEXT” between two “VMLA” which needs 9 execution cycles according to the table of NEON floating-point instructions timing.

VEXT qTemp1,qInp,qTemp,#1

VMLA qAcc0,qInp,dCoeff_0[0]

VEXT qTemp2,qInp,qTemp,#2

VMLA qAcc0,qTemp1,dCoeff_0[1]

VEXT qTemp3,qInp,qTemp,#3

VMLA qAcc0,qTemp2,dCoeff_1[0]

VMLA qAcc0,qTemp3,dCoeff_1[1]

Implementation 2: there is still data dependency on qAcc0. But VADD/VMUL needs 5 execution cycles only.

VEXT qTemp1,qInp,qTemp,#1

VMLA qAcc0,qInp,dCoeff_0[0] ]

VMUL qAcc1,qTemp1,dCoeff_0[1]

VEXT qTemp2,qInp,qTemp,#2

VMUL qAcc2,qTemp2,dCoeff_1[0]

VADD qAcc0, qAcc0, qAcc1

VEXT qTemp3,qInp,qTemp,#3

VMUL qAcc3,qTemp3,dCoeff_1[1]

VADD qAcc0, qAcc0, qAcc2

VADD qAcc0, qAcc0, qAcc3

The benchmark shows that implementation 2 has a better performance.

For more code details, see GitHub.

NEON floating-point instructions timing:

Name Format Cycles Result
VADDVSUBVMUL Dd,Dn,Dm 1 5
Qd,Qn,Dm 2 5
VMLAVMLS Dd,Dn,Dm 1 9
Qd,Qn,Dm 2 9

The table is from appendix[ii]。

Where:

  • Cycles: instruction issue
  • Result: instruction execute

2.5 Summary

NEON optimization techniques are summarized as follows:

  • Utilize the delay slot of instruction as much as possible.
  • Avoid branches.
  • Pay attention to cache hit.

3. NEON assembly and intrinsics

In “ARM NEON programming quick guide”, there is a simple comparison of the pros and cons of NEON assembly and intrinsics:

NEON assembly NEON intrinsic
Performance Always shows the best performance for the specified platform for an experienced developer. Depends heavily on the toolchain that is used
Portability The different ISA (i.e. ARMv7-A/v8-A AArch32 and ARMv8-A AArch64) has different assembly implementation. Even for the same ISA, the assembly might need to be fine-tuned to achieve ideal performance between different micro architectures. Program once and run on different ISA’s. The compiler may also grant performance fine-tuning for different micro-architectures.
Maintainability Hard to read/write compared with C. Similar to C code, it’s easy to read/write.

But the reality is far more complex than many of these, especially when it comes to ARMv7-A/v8-A cross-platform. In the following sections, this issue will be analyzed further with some examples.

3.1 Programming

For NEON beginners, intrinsics are easier to use than assembly. But experienced developers may be more familiar with NEON assembly programming. They need time to adapt to the intrinsics coding. Some issues that may occur in real development are described in the following.

3.1.1 Flexibility of instruction

From the perspective of using instructions, assembly instruction is more flexible than intrinsics. It is mainly reflected in the data load / store.

Example:

Intrinsics instruction Load data into a 64-bit register, vld1_s8/u8/s16/u16/s32…etc
Load data into a 128-bit register, vld1q_s8/u8/s16/u16/s32…etc
ARMv7-A assembly VLD1 { Dd}, [Rn]
VLD1 { Dd, D(d+1) }, [Rn]
VLD1 { Dd, D(d+1), D(d+2)}, [Rn]
VLD1 { Dd, D(d+1), D(d+2), D(d+3) }, [Rn]
ARMv8-A assembly LD1 { .<T}, [<Xn|SP>]
LD1 { ., .}, [<Xn|SP>]
LD1 { ., ., . }, [<Xn|SP>]
LD1 { ., ., ., . }, [<Xn|SP>]

This issue will be fixed with upgrading the compiler in the future. Sometimes, the compiler has been able to translate two intrinsics instructions into one assembly instruction, such as:

compiler translating two intrinsics instructions into one assembly instruction

Therefore, it’s expected that intrinsics instruction will have the same flexibility with assembly instructions, with the upgrade of the ARMv8 toolchain.

3.1.2 Register allocation

When programming in NEON assembly, registers have to be allocated by users. It must be known clearly which registers are occupied. One of benefits of programming in intrinsics is that users only need to define variables. The compiler will allocate registers automatically. This is an advantage, but it might be a weakness in some cases. Practice has proved that using too many NEON registers simultaneously in the intrinsics programming will bring gcc compiler register allocation issue. When this issue happens, many data are pushed into the stack, which will greatly affect performance of the program. Therefore, users should pay attention to this issue when intrinsics programming. When there is performance exception (such as the performance of C is better than it of NEON), you need to check the disassembly to see if there is register allocation issue firstly. For ARMv8-A AArch64, there are more registers (32 128-bit NEON registers). The impact of this issue is significantly reduced.

3.2 Performance and compiler

On one platform, the performance of NEON assembly is only decided by implementation, and has nothing to do with the compiler. The benefit is that you can predict and control the performance of program when you hand-tune the codes, but there isn’t surprise.

Conversely, the performance of NEON intrinsics is greatly dependent on the compiler used. Different compilers may bring very different performance. In general, the older the compiler, the worse the performance. When compatibility with older compilers is needed, it must be considered carefully whether the intrinsics will fit the need. In addition, when fine-tuning the code, it’s hard for user to predict the change of performance with the intervention of the compiler. But there may be surprise. Sometimes the intrinsics might bring better performance than assembly. This is very rare, but does occur.

Compiler will have an impact on the process of NEON optimization. The following figure describes the general process of NEON implementation and optimization.

   process of NEON implementation and optimization.

NEON assembly and intrinsics have the same process of implementation, coding - debugging – performance test. But they have different process of optimization step.The methods of assembly fine-tuning are:

  • Change implementations, such as changing the instructions, adjusting the parallelism.
  • Adjust the instruction sequence to reduce data dependency.
  • Try the skills described in section 2.

When fine-tuning assembly, a sophisticated way is that:

  • Know the number of used instructions exactly
  • Get the execution cycles of program by using the PMU (Performance Monitoring Unit).
  • Adjust the sequence of instructions based on the timing of used instructions. And try to minimize the instruction delay as possible as you can

The disadvantage of this approach is that the changes are specific to one micro-architecture. When the platform is switched, performance improvement achieved might be lost. This is also very time-consuming to do for often comparatively small gains.Fine-tuning of NEON intrinsics is more difficult.

  • Try the methods used in NEON assembly optimization.
  • Look at the generated assembly and check the data dependencies and register usage.
  • Check whether the performance meets the expectation. If yes, the work of optimization is done. Then the performance with other compilers needs to be verified again.

When porting the assembly code of ARMv7-A with intrinsics for ARMv7-A/v8-A compatibility, performance of assembly can be used as a reference of performance. So it is easy to check whether the work is done.  However, when intrinsics are used to optimize ARMv8-A code, there is not a performance reference. It is difficult to determine whether the performance is optimal. Based on the experience on ARMv7-A, there might be a doubt whether the assembly has the better performance. I think the impact of this issue will become smaller and smaller with the maturity of the ARMv8-A environment.

3.3 Cross-platform and portability

Now, most of the existing NEON assembly codes can only run on the platforms of ARMv7-A/ARMv8-A AArch32 mode. If you want to run them on platforms of ARMv8-A AArch64 mode, you must rewrite these codes, which take a lot of work. In such situation, if the codes are programmed with NEON intrinsics, they can be run directly on platforms of ARMv8-A AArch64 mode. Cross-platform is one of great advantages. Meanwhile, you just need to maintain on set of code for different platform by using intrinsics, which also significantly reduces the maintenance effort. However, due to the different hardware resources on ARMv7-A/ARMv8-A platform, sometimes there still might be two sets of code even with intrinsics. The FFT implementation in Ne10 project is an example:

// radix 4 butterfly with twiddlesscratch[0].r = scratch_in[0].r;scratch[0].i = scratch_in[0].i;scratch[1].r = scratch_in[1].r * scratch_tw[0].r - scratch_in[1].i * scratch_tw[0].i;scratch[1].i = scratch_in[1].i * scratch_tw[0].r + scratch_in[1].r * scratch_tw[0].i;scratch[2].r = scratch_in[2].r * scratch_tw[1].r - scratch_in[2].i * scratch_tw[1].i;scratch[2].i = scratch_in[2].i * scratch_tw[1].r + scratch_in[2].r * scratch_tw[1].i;scratch[3].r = scratch_in[3].r * scratch_tw[2].r - scratch_in[3].i * scratch_tw[2].i;scratch[3].i = scratch_in[3].i * scratch_tw[2].r + scratch_in[3].r * scratch_tw[2].i;

The above code snippet lists the basic element of FFT---- radix4 butterfly. From the code, the following can be concluded:

  • 20 64-bit NEON registers are needed if 2 radix4 butterflies are executed in one loop.
  • 20 128-bit NEON registers are needed if 4 radix4 butterflies are executed in one loop.

And, for ARMv7-A/v8-A AArch32 and v8-A AArch64,

  • There are 32 64-bit or 16 128-bit NEON registers for ARMv7-A/v8-A AArch32.
  • There are 32 128-bit NEON registers for ARMv8-A AArch64.

Considering the above factors, the FFT implementation of Ne10 eventually has an assembly version, in which 2 radix4 butterflies are executed in one loop, for ARMv7-A/v8-A AAch32, and an intrinsic version, in which 4 radix4 butterflies are executed in one loop, for ARMv8-A AArch64.The above example can illustrate that you need to pay attention to some exceptions when maintaining one set of code across ARMv7-A/v8-A platform,

3.4 Future

Many issues about using NEON assembly and intrinsics have been discussed. But these issues are temporary. In the long term, intrinsics will be better. By using intrinsics, you can reap the benefits of hardware and compiler upgrade without reprogramming. That means some classical algorithms just need to be implemented once.

The compiler will help to adjust these codes for new hardware, which reduces the workload significantly. Pffft is an example.The following figure describes the performance of pffft and Ne10 real FFT on the ARM Cortex-A9 platform with gcc. X-axis represents the length of FFT. Y-axis represents the time of execution, the smaller the better. Pffft is implemented with NEON intrinsics. Ne10 real FFT is implemented with NEON assembly. They don’t use the same algorithm, but they have similar performance.  

 Ne10 FFT vs pfffr on Cortex-A9
In the ARMv8-A AArch64 mode, the Ne10 real FFT is rewritten with both NEON assembly and intrinsics. Section 3.3 has explained that ARMv8-A can process 4 butterflies in parallel, but ARMv7-A can only process 2 butterflies in parallel. So theoretically, the effect of FFT optimization on ARMv8-A would be better than that on ARMv7-A. However, based on the following figure, it’s clear that pffft has the best performance. The result illustrates that the compiler should have done very good optimizations specified for ARMv8 architecture.
    Ne10 FFT vs pffft on Arm Cortex-A53
From this example, it is concluded that: pffft, the performance of which isn’t the best on the ARMv7-A, shows a very good performance in the ARMv8-A AArch64 mode. That proves the point: compiler can adjust intrinsics automatically for ARMv8-A to achieve a good performance.In the long run, existing NEON assembly code have to be rewritten for ARMv8. If NEON is upgraded in the future, the code has to be rewritten again and again. But for NEON intrinsics code, it is expected that it may show good performance on ARMv8-A with the help of compilers. Even if NEON is upgraded, you can also look forward to the upgrade of compilers.

3.5 Summary

In this section, the pros and cons of NEON assembly and intrinsics have been analyzed with some examples. The benefits of intrinsics far outweigh drawbacks. Compared to assembly, Intrinsics are more easily programmed, and have better compatibility between ARMv7 and ARMv8.Some tips for NEON intrinsics are summarized as follows:
  • The number of registers used
  • Compiler
  • Do look at the generated assembly

4. End

This blog mainly discusses some common NEON optimization skills and analyzes the pros and cons of NEON assembly and intrinsics with some examples. Hope that these can help NEON developers in actual developmen

References

[i] Cortex -A Series Version: 2.0 Programmer’s Guide: 17-12 and 17-20
[ii] Cortex -A9 NEON  Media Processing Engine Revision: r4p1 Technical Reference Manual: 3.4.8
Anonymous
  • junaid shuja
    junaid shuja over 8 years ago

    Hi, I am looking for FFT code that has neon intrinsics. I have looked into PFFFT but it does not have any ARM NEON intrinsics in code.

    Actually, I have code that translates NEON intrinsics to x86 SSE using gcc-x86 compiler and custom header files instead of arm_neon.h. But translating the Ne10 is not possible as it will contain code other than the NEON intrinsics that are specific to ARM architecture. Can you suggest me a FFT code with NEON intrinsics other than Ne10?

    • Cancel
    • Up 0 Down
    • Reply
    • More
    • Cancel
Operating Systems blog
  • Enhancing Chromium’s Memory Safety with Armv9

    Richard Townsend
    Richard Townsend
    The Arm Open-source Software team is delighted to mark the release of Chromium M115, with experimental support for Arm’s Memory Tagging Extension (MTE).
    • August 7, 2023
  • New Memory Tagging Extension User Guide for Android OS Developers

    Roberto Lopez Mendez
    Roberto Lopez Mendez
    In this blog, read about what to expect with the new MTE User Guide for Android OS.
    • May 25, 2023
  • Enhancing Chromium's Control Flow Integrity with Armv9

    Richard Townsend
    Richard Townsend
    This blog explains how Control Flow Integrity, an Armv9 security feature, works on the newly launched Chromium M105.
    • October 11, 2022