This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON Intrinsics Performance

Hi ARM-Support,

I compared the performance of plain C code and C code with NEON intrinsics on my RaspberryPi 4 and was surprised that the plain C code is slightly faster. The code is in both cases some XOR operations (200 bytes in the plain C, 400 bytes in the NEON C code). By looking at the assembler code generated by GCC my impression was, that the NEON code should be a little bit faster, as it has fewer instructions.

The measurements were performed by using the CNTPCT_EL0, Counter-timer Physical Count register.

Do you have an idea why the NEON C code is slightly slower compared to pure C and is there somewhere documented how many clock cycles the instructions need if NEON registers are used?

NEON code:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ldp q0, q7, [x0]
ldp q6, q5, [x0, #32]
ldp q16, q4, [x0, #64]
ldp q3, q2, [x0, #96]
ldr q1, [x0, #128]
eor v4.16b, v4.16b, v0.16b
eor v1.16b, v1.16b, v5.16b
eor v3.16b, v3.16b, v7.16b
eor v2.16b, v2.16b, v6.16b
ldp q0, q7, [x0, #144]
ldp q6, q5, [x0, #176]
eor v0.16b, v0.16b, v16.16b
eor v4.16b, v4.16b, v7.16b
ldp q16, q7, [x0, #208]
eor v3.16b, v3.16b, v6.16b
eor v2.16b, v2.16b, v5.16b
ldp q6, q5, [x0, #240]
eor v1.16b, v1.16b, v16.16b
eor v0.16b, v0.16b, v7.16b
ldp q16, q7, [x0, #272]
eor v4.16b, v4.16b, v6.16b
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

C  Code:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ldp x5, x4, [x0]
ldp x1, x6, [x0, #32]
ldp x7, x8, [x0, #48]
eor x5, x5, x6
ldp x3, x2, [x0, #16]
eor x4, x4, x7
ldp x6, x7, [x0, #64]
eor x3, x3, x8
ldp x11, x13, [x0, #120]
eor x2, x2, x6
eor x1, x1, x7
ldr x6, [x0, #88]
ldr x9, [x0, #96]
ldr x12, [x0, #136]
eor x6, x6, x13
ldr x10, [x0, #80]
eor x4, x4, x6
ldr x6, [x0, #160]
eor x9, x9, x12
eor x3, x3, x9
ldr x9, [x0, #168]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Best regards

0