Hi ARM-Support,
I compared the performance of plain C code and C code with NEON intrinsics on my RaspberryPi 4 and was surprised that the plain C code is slightly faster. The code is in both cases some XOR operations (200 bytes in the plain C, 400 bytes in the NEON C code). By looking at the assembler code generated by GCC my impression was, that the NEON code should be a little bit faster, as it has fewer instructions.
The measurements were performed by using the CNTPCT_EL0, Counter-timer Physical Count register.
Do you have an idea why the NEON C code is slightly slower compared to pure C and is there somewhere documented how many clock cycles the instructions need if NEON registers are used?
NEON code:
ldp q0, q7, [x0] ldp q6, q5, [x0, #32] ldp q16, q4, [x0, #64] ldp q3, q2, [x0, #96] ldr q1, [x0, #128] eor v4.16b, v4.16b, v0.16b eor v1.16b, v1.16b, v5.16b eor v3.16b, v3.16b, v7.16b eor v2.16b, v2.16b, v6.16b ldp q0, q7, [x0, #144] ldp q6, q5, [x0, #176] eor v0.16b, v0.16b, v16.16b eor v4.16b, v4.16b, v7.16b ldp q16, q7, [x0, #208] eor v3.16b, v3.16b, v6.16b eor v2.16b, v2.16b, v5.16b ldp q6, q5, [x0, #240] eor v1.16b, v1.16b, v16.16b eor v0.16b, v0.16b, v7.16b ldp q16, q7, [x0, #272] eor v4.16b, v4.16b, v6.16b eor v3.16b, v3.16b, v5.16b ldp q6, q5, [x0, #304] eor v2.16b, v2.16b, v16.16b eor v1.16b, v1.16b, v7.16b ldp q16, q7, [x0, #336] eor v4.16b, v4.16b, v5.16b eor v0.16b, v0.16b, v6.16b ldp q6, q5, [x0, #368] eor v3.16b, v3.16b, v16.16b eor v2.16b, v2.16b, v7.16b eor v1.16b, v1.16b, v6.16b eor v0.16b, v0.16b, v5.16b stp q4, q3, [x0] stp q2, q1, [x0, #32] str q0, [x0, #64] ret nop nop nop
C Code:
ldp x5, x4, [x0] ldp x1, x6, [x0, #32] ldp x7, x8, [x0, #48] eor x5, x5, x6 ldp x3, x2, [x0, #16] eor x4, x4, x7 ldp x6, x7, [x0, #64] eor x3, x3, x8 ldp x11, x13, [x0, #120] eor x2, x2, x6 eor x1, x1, x7 ldr x6, [x0, #88] ldr x9, [x0, #96] ldr x12, [x0, #136] eor x6, x6, x13 ldr x10, [x0, #80] eor x4, x4, x6 ldr x6, [x0, #160] eor x9, x9, x12 eor x3, x3, x9 ldr x9, [x0, #168] eor x10, x10, x11 eor x5, x5, x10 ldr x8, [x0, #104] ldp x11, x10, [x0, #144] eor x5, x6, x5 eor x4, x4, x9 ldr x7, [x0, #112] eor x8, x8, x11 stp x5, x4, [x0] eor x2, x2, x8 ldr x8, [x0, #176] eor x7, x7, x10 ldr x6, [x0, #184] eor x1, x1, x7 ldr x5, [x0, #192] eor x3, x3, x8 eor x2, x2, x6 eor x1, x1, x5 stp x3, x2, [x0, #16] str x1, [x0, #32] ret nop nop
Best regards
Ok, I see. Can I rely on GCC/Clang that they can put the instructions in such a way that this additional latency can be hidden or should I do this manually?