This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 27th November 2012 at http://forums.arm.com

    Thanks for your reply,

    The both ARM assembly code and NEON assembly code do same functionality.

    But as you know I am getting 70% improvement on NEON assembly compared with pure fixed pint C
    code for this algorithm. I am getting 40% improvement with ARM 9 assembly code.

    I mean NEON assembly code and ARM assembly code difference is only 30 -35 % difference.
    This is my issue, why only 30% improvement on NEON assembly compared with ARM assembly code. ???

    I also knew there is "out of order" feature in Cortex-A9, and this feature only help for ARM
    instructions. But due to this feature arm assembly code is performing better on Cortex-A9,
    which is the reason I see less performance difference between NEON and ARM assembly code.

    Can you explain me in detail.   

    As you know I written both arm and NEON assembly code to understand the difference between two unit.
    I expect for NEON case 4*ARM improvement .
Reply
  • Note: This was originally posted on 27th November 2012 at http://forums.arm.com

    Thanks for your reply,

    The both ARM assembly code and NEON assembly code do same functionality.

    But as you know I am getting 70% improvement on NEON assembly compared with pure fixed pint C
    code for this algorithm. I am getting 40% improvement with ARM 9 assembly code.

    I mean NEON assembly code and ARM assembly code difference is only 30 -35 % difference.
    This is my issue, why only 30% improvement on NEON assembly compared with ARM assembly code. ???

    I also knew there is "out of order" feature in Cortex-A9, and this feature only help for ARM
    instructions. But due to this feature arm assembly code is performing better on Cortex-A9,
    which is the reason I see less performance difference between NEON and ARM assembly code.

    Can you explain me in detail.   

    As you know I written both arm and NEON assembly code to understand the difference between two unit.
    I expect for NEON case 4*ARM improvement .
Children
No data