This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Oh OK yes it makes sense. Normally I would say that it is because memory access is your main bottleneck, so it doesn't matter if you speedup your calculations because your CPU is nearly always just waiting around for the data from memory. As I mentioned, this is a common problem on Cortex-A8, even worse on Cortex-A9, and will sometimes be a problem on Cortex-A15.

    But in your specific case (involving multiplies), if you look at the CPU pipeline of Cortex-A9 or the instruction timings of Cortex-A9, you will see that multiplication is only performed with 32-bits at a time. This is for ARM CPU code and for NEON code, even if your instruction is VMUL using 128-bit registers. eg: 128-bit VMUL takes 4 times as many cycles than 32-bit MUL because 128-bit MUL is basically a macro instruction to perform 4 x 32-bit multiplies in a sequence. This is quite different to the behaviour of other instructions such as Addition, where 128-bit VADD is typically the same speed as 32-bit ADD, rather than 4x slower.
Reply
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Oh OK yes it makes sense. Normally I would say that it is because memory access is your main bottleneck, so it doesn't matter if you speedup your calculations because your CPU is nearly always just waiting around for the data from memory. As I mentioned, this is a common problem on Cortex-A8, even worse on Cortex-A9, and will sometimes be a problem on Cortex-A15.

    But in your specific case (involving multiplies), if you look at the CPU pipeline of Cortex-A9 or the instruction timings of Cortex-A9, you will see that multiplication is only performed with 32-bits at a time. This is for ARM CPU code and for NEON code, even if your instruction is VMUL using 128-bit registers. eg: 128-bit VMUL takes 4 times as many cycles than 32-bit MUL because 128-bit MUL is basically a macro instruction to perform 4 x 32-bit multiplies in a sequence. This is quite different to the behaviour of other instructions such as Addition, where 128-bit VADD is typically the same speed as 32-bit ADD, rather than 4x slower.
Children
No data