This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON Intrinsics Performance

Hi ARM-Support,

I compared the performance of plain C code and C code with NEON intrinsics on my RaspberryPi 4 and was surprised that the plain C code is slightly faster. The code is in both cases some XOR operations (200 bytes in the plain C, 400 bytes in the NEON C code). By looking at the assembler code generated by GCC my impression was, that the NEON code should be a little bit faster, as it has fewer instructions.

The measurements were performed by using the CNTPCT_EL0, Counter-timer Physical Count register.

Do you have an idea why the NEON C code is slightly slower compared to pure C and is there somewhere documented how many clock cycles the instructions need if NEON registers are used?

NEON code:

ldp	q0, q7, [x0]
ldp	q6, q5, [x0, #32]
ldp	q16, q4, [x0, #64]
ldp	q3, q2, [x0, #96]
ldr	q1, [x0, #128]
eor	v4.16b, v4.16b, v0.16b
eor	v1.16b, v1.16b, v5.16b
eor	v3.16b, v3.16b, v7.16b
eor	v2.16b, v2.16b, v6.16b
ldp	q0, q7, [x0, #144]
ldp	q6, q5, [x0, #176]
eor	v0.16b, v0.16b, v16.16b
eor	v4.16b, v4.16b, v7.16b
ldp	q16, q7, [x0, #208]
eor	v3.16b, v3.16b, v6.16b
eor	v2.16b, v2.16b, v5.16b
ldp	q6, q5, [x0, #240]
eor	v1.16b, v1.16b, v16.16b
eor	v0.16b, v0.16b, v7.16b
ldp	q16, q7, [x0, #272]
eor	v4.16b, v4.16b, v6.16b
eor	v3.16b, v3.16b, v5.16b
ldp	q6, q5, [x0, #304]
eor	v2.16b, v2.16b, v16.16b
eor	v1.16b, v1.16b, v7.16b
ldp	q16, q7, [x0, #336]
eor	v4.16b, v4.16b, v5.16b
eor	v0.16b, v0.16b, v6.16b
ldp	q6, q5, [x0, #368]
eor	v3.16b, v3.16b, v16.16b
eor	v2.16b, v2.16b, v7.16b
eor	v1.16b, v1.16b, v6.16b
eor	v0.16b, v0.16b, v5.16b
stp	q4, q3, [x0]
stp	q2, q1, [x0, #32]
str	q0, [x0, #64]
ret
nop
nop
nop

C  Code:

ldp	x5, x4, [x0]
ldp	x1, x6, [x0, #32]
ldp	x7, x8, [x0, #48]
eor	x5, x5, x6
ldp	x3, x2, [x0, #16]
eor	x4, x4, x7
ldp	x6, x7, [x0, #64]
eor	x3, x3, x8
ldp	x11, x13, [x0, #120]
eor	x2, x2, x6
eor	x1, x1, x7
ldr	x6, [x0, #88]
ldr	x9, [x0, #96]
ldr	x12, [x0, #136]
eor	x6, x6, x13
ldr	x10, [x0, #80]
eor	x4, x4, x6
ldr	x6, [x0, #160]
eor	x9, x9, x12
eor	x3, x3, x9
ldr	x9, [x0, #168]
eor	x10, x10, x11
eor	x5, x5, x10
ldr	x8, [x0, #104]
ldp	x11, x10, [x0, #144]
eor	x5, x6, x5
eor	x4, x4, x9
ldr	x7, [x0, #112]
eor	x8, x8, x11
stp	x5, x4, [x0]
eor	x2, x2, x8
ldr	x8, [x0, #176]
eor	x7, x7, x10
ldr	x6, [x0, #184]
eor	x1, x1, x7
ldr	x5, [x0, #192]
eor	x3, x3, x8
eor	x2, x2, x6
eor	x1, x1, x5
stp	x3, x2, [x0, #16]
str	x1, [x0, #32]
ret
nop
nop

Best regards

Parents Reply Children
  • Hi Vincent,

    thank you for your answer. Exactly what I was looking for.

    From the tables in the Cortex-A72 Software Optimization Guide it seems that the latency for the EOR (exclusive or) operation using the basic ALU (AArch64, page 8) is equal to 1 and for the ASIMD instruction it has a latency of 3 (AArch64, page 24).
    So the main difference between the above two code examples is only the EOR operation, or did I miss something?
    Then it is clear why the NEON implementation is a little bit slower.
    The LOAD/STORE operations from ASIMD (also somewhat slow) were not used, or?

    Thank you for the document.

    Best regards

  • Considering EOR throughput, the Cortex-A72 should be able to perform up to 2 EOR operations each cycle, using either the 64b R registers or using the 128b V registers with Advanced SIMD. This means that using Advanced SIMD instructions, you might theoretically be able to double the overall EOR throughput of your code.

    For that, you will need to "hide" the additional latency. On an "out of order" core such as the A72, putting more instructions "in flight" would help.

    In practice there might be other bottlenecks in your code than just the EOR. This is where profiling with the performance counters could help.

  • Ok, I see. Can I rely on GCC/Clang that they can put the instructions in such a way that this additional latency can be hidden or should I do this manually?