This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON Intrinsics Performance

RW-sec over 4 years ago

Hi ARM-Support,

I compared the performance of plain C code and C code with NEON intrinsics on my RaspberryPi 4 and was surprised that the plain C code is slightly faster. The code is in both cases some XOR operations (200 bytes in the plain C, 400 bytes in the NEON C code). By looking at the assembler code generated by GCC my impression was, that the NEON code should be a little bit faster, as it has fewer instructions.

The measurements were performed by using the CNTPCT_EL0, Counter-timer Physical Count register.

Do you have an idea why the NEON C code is slightly slower compared to pure C and is there somewhere documented how many clock cycles the instructions need if NEON registers are used?

NEON code:

ldp	q0, q7, [x0]
ldp	q6, q5, [x0, #32]
ldp	q16, q4, [x0, #64]
ldp	q3, q2, [x0, #96]
ldr	q1, [x0, #128]
eor	v4.16b, v4.16b, v0.16b
eor	v1.16b, v1.16b, v5.16b
eor	v3.16b, v3.16b, v7.16b
eor	v2.16b, v2.16b, v6.16b
ldp	q0, q7, [x0, #144]
ldp	q6, q5, [x0, #176]
eor	v0.16b, v0.16b, v16.16b
eor	v4.16b, v4.16b, v7.16b
ldp	q16, q7, [x0, #208]
eor	v3.16b, v3.16b, v6.16b
eor	v2.16b, v2.16b, v5.16b
ldp	q6, q5, [x0, #240]
eor	v1.16b, v1.16b, v16.16b
eor	v0.16b, v0.16b, v7.16b
ldp	q16, q7, [x0, #272]
eor	v4.16b, v4.16b, v6.16b
eor	v3.16b, v3.16b, v5.16b
ldp	q6, q5, [x0, #304]
eor	v2.16b, v2.16b, v16.16b
eor	v1.16b, v1.16b, v7.16b
ldp	q16, q7, [x0, #336]
eor	v4.16b, v4.16b, v5.16b
eor	v0.16b, v0.16b, v6.16b
ldp	q6, q5, [x0, #368]
eor	v3.16b, v3.16b, v16.16b
eor	v2.16b, v2.16b, v7.16b
eor	v1.16b, v1.16b, v6.16b
eor	v0.16b, v0.16b, v5.16b
stp	q4, q3, [x0]
stp	q2, q1, [x0, #32]
str	q0, [x0, #64]
ret
nop
nop
nop

C Code:

ldp	x5, x4, [x0]
ldp	x1, x6, [x0, #32]
ldp	x7, x8, [x0, #48]
eor	x5, x5, x6
ldp	x3, x2, [x0, #16]
eor	x4, x4, x7
ldp	x6, x7, [x0, #64]
eor	x3, x3, x8
ldp	x11, x13, [x0, #120]
eor	x2, x2, x6
eor	x1, x1, x7
ldr	x6, [x0, #88]
ldr	x9, [x0, #96]
ldr	x12, [x0, #136]
eor	x6, x6, x13
ldr	x10, [x0, #80]
eor	x4, x4, x6
ldr	x6, [x0, #160]
eor	x9, x9, x12
eor	x3, x3, x9
ldr	x9, [x0, #168]
eor	x10, x10, x11
eor	x5, x5, x10
ldr	x8, [x0, #104]
ldp	x11, x10, [x0, #144]
eor	x5, x6, x5
eor	x4, x4, x9
ldr	x7, [x0, #112]
eor	x8, x8, x11
stp	x5, x4, [x0]
eor	x2, x2, x8
ldr	x8, [x0, #176]
eor	x7, x7, x10
ldr	x6, [x0, #184]
eor	x1, x1, x7
ldr	x5, [x0, #192]
eor	x3, x3, x8
eor	x2, x2, x6
eor	x1, x1, x5
stp	x3, x2, [x0, #16]
str	x1, [x0, #32]
ret
nop
nop

Best regards

Top replies

vstehle over 4 years ago +3 verified

Hi @RW-sec, > ... is there somewhere documented how many clock cycles the instructions need if NEON registers are used? The Cortex-A72 Software Optimization Guide has detailed throughput and latency...

Parents

0 RW-sec over 4 years ago in reply to vstehle

Ok, I see. Can I rely on GCC/Clang that they can put the instructions in such a way that this additional latency can be hidden or should I do this manually?
Cancel
Vote up 0 Vote down

Cancel

Reply

0 RW-sec over 4 years ago in reply to vstehle

Ok, I see. Can I rely on GCC/Clang that they can put the instructions in such a way that this additional latency can be hidden or should I do this manually?
Cancel
Vote up 0 Vote down

Cancel

Children

No data