This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

NEON Intrinsics Performance

Hi ARM-Support,

I compared the performance of plain C code and C code with NEON intrinsics on my RaspberryPi 4 and was surprised that the plain C code is slightly faster. The code is in both cases some XOR operations (200 bytes in the plain C, 400 bytes in the NEON C code). By looking at the assembler code generated by GCC my impression was, that the NEON code should be a little bit faster, as it has fewer instructions.

The measurements were performed by using the CNTPCT_EL0, Counter-timer Physical Count register.

Do you have an idea why the NEON C code is slightly slower compared to pure C and is there somewhere documented how many clock cycles the instructions need if NEON registers are used?

NEON code:

C  Code:

Best regards

0