Hi ARM-Support,
I compared the performance of plain C code and C code with NEON intrinsics on my RaspberryPi 4 and was surprised that the plain C code is slightly faster. The code is in both cases some XOR operations (200 bytes in the plain C, 400 bytes in the NEON C code). By looking at the assembler code generated by GCC my impression was, that the NEON code should be a little bit faster, as it has fewer instructions.
The measurements were performed by using the CNTPCT_EL0, Counter-timer Physical Count register.
Do you have an idea why the NEON C code is slightly slower compared to pure C and is there somewhere documented how many clock cycles the instructions need if NEON registers are used?
NEON code:
ldp q0, q7, [x0] ldp q6, q5, [x0, #32] ldp q16, q4, [x0, #64] ldp q3, q2, [x0, #96] ldr q1, [x0, #128] eor v4.16b, v4.16b, v0.16b eor v1.16b, v1.16b, v5.16b eor v3.16b, v3.16b, v7.16b eor v2.16b, v2.16b, v6.16b ldp q0, q7, [x0, #144] ldp q6, q5, [x0, #176] eor v0.16b, v0.16b, v16.16b eor v4.16b, v4.16b, v7.16b ldp q16, q7, [x0, #208] eor v3.16b, v3.16b, v6.16b eor v2.16b, v2.16b, v5.16b ldp q6, q5, [x0, #240] eor v1.16b, v1.16b, v16.16b eor v0.16b, v0.16b, v7.16b ldp q16, q7, [x0, #272] eor v4.16b, v4.16b, v6.16b eor v3.16b, v3.16b, v5.16b ldp q6, q5, [x0, #304] eor v2.16b, v2.16b, v16.16b eor v1.16b, v1.16b, v7.16b ldp q16, q7, [x0, #336] eor v4.16b, v4.16b, v5.16b eor v0.16b, v0.16b, v6.16b ldp q6, q5, [x0, #368] eor v3.16b, v3.16b, v16.16b eor v2.16b, v2.16b, v7.16b eor v1.16b, v1.16b, v6.16b eor v0.16b, v0.16b, v5.16b stp q4, q3, [x0] stp q2, q1, [x0, #32] str q0, [x0, #64] ret nop nop nop
C Code:
ldp x5, x4, [x0] ldp x1, x6, [x0, #32] ldp x7, x8, [x0, #48] eor x5, x5, x6 ldp x3, x2, [x0, #16] eor x4, x4, x7 ldp x6, x7, [x0, #64] eor x3, x3, x8 ldp x11, x13, [x0, #120] eor x2, x2, x6 eor x1, x1, x7 ldr x6, [x0, #88] ldr x9, [x0, #96] ldr x12, [x0, #136] eor x6, x6, x13 ldr x10, [x0, #80] eor x4, x4, x6 ldr x6, [x0, #160] eor x9, x9, x12 eor x3, x3, x9 ldr x9, [x0, #168] eor x10, x10, x11 eor x5, x5, x10 ldr x8, [x0, #104] ldp x11, x10, [x0, #144] eor x5, x6, x5 eor x4, x4, x9 ldr x7, [x0, #112] eor x8, x8, x11 stp x5, x4, [x0] eor x2, x2, x8 ldr x8, [x0, #176] eor x7, x7, x10 ldr x6, [x0, #184] eor x1, x1, x7 ldr x5, [x0, #192] eor x3, x3, x8 eor x2, x2, x6 eor x1, x1, x5 stp x3, x2, [x0, #16] str x1, [x0, #32] ret nop nop
Best regards
Hi @RW-sec,
> ... is there somewhere documented how many clock cycles the instructions need if NEON registers are used?
The Cortex-A72 Software Optimization Guide has detailed throughput and latency numbers for the Advanced SIMD instructions.
Best regards,
Vincent.
Hi Vincent,
thank you for your answer. Exactly what I was looking for.
From the tables in the Cortex-A72 Software Optimization Guide it seems that the latency for the EOR (exclusive or) operation using the basic ALU (AArch64, page 8) is equal to 1 and for the ASIMD instruction it has a latency of 3 (AArch64, page 24).So the main difference between the above two code examples is only the EOR operation, or did I miss something?Then it is clear why the NEON implementation is a little bit slower.The LOAD/STORE operations from ASIMD (also somewhat slow) were not used, or?
Thank you for the document.
Considering EOR throughput, the Cortex-A72 should be able to perform up to 2 EOR operations each cycle, using either the 64b R registers or using the 128b V registers with Advanced SIMD. This means that using Advanced SIMD instructions, you might theoretically be able to double the overall EOR throughput of your code.
For that, you will need to "hide" the additional latency. On an "out of order" core such as the A72, putting more instructions "in flight" would help.
In practice there might be other bottlenecks in your code than just the EOR. This is where profiling with the performance counters could help.
Ok, I see. Can I rely on GCC/Clang that they can put the instructions in such a way that this additional latency can be hidden or should I do this manually?