This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    How do you know that code 1 is normal ARM instructions and code 2 is NEON instructions and 1/4 of instructions? Did you write both codes in Assembly or NEON Intrinsics to make sure, or are you just using plain C/C++ code?

    Also, it is very common on Cortex-A9 that memory will be your bottleneck, not your CPU arithmetic. So if you are completely "memory bound" then using NEON instead of ARM CPU code will often have no difference, and you are better of looking into other optimization possibilities such as cache preloading, and/or designing your code to make better use of cache, and/or using GPU to perform the operation (eg: GLSL shaders for current hardware or GPGPU acceleration if you are targeting future systems).
Reply
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    How do you know that code 1 is normal ARM instructions and code 2 is NEON instructions and 1/4 of instructions? Did you write both codes in Assembly or NEON Intrinsics to make sure, or are you just using plain C/C++ code?

    Also, it is very common on Cortex-A9 that memory will be your bottleneck, not your CPU arithmetic. So if you are completely "memory bound" then using NEON instead of ARM CPU code will often have no difference, and you are better of looking into other optimization possibilities such as cache preloading, and/or designing your code to make better use of cache, and/or using GPU to perform the operation (eg: GLSL shaders for current hardware or GPGPU acceleration if you are targeting future systems).
Children
No data