This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Hi,

    I executed NEON operation test on Linux platform board.
    (1)   Matrix multiplication: Method of  calculating one bye one.Here I have used only S registers. (Normal ARM instructions)


    (2)   Matrix multiplication: Since 128  bit calculation is done, the number of instructions will become 1/4 compared to  (1). Here I have used Q and D registers. (Neon instructions)


    I am using linux 3.0.35  and test code is executed on Linux platform (Cortex-a9 architecture) .
    But there is no speed difference between (1) and (2).


    In my Linux kernel configuration following options enabled
    CONFIG_VFP=y
    CONFIG_VFPv3=y
    CONFIG_NEON=y

    Following gcc command I have used to build the NEON application and  gcc compiler version is gcc 4.6.2
    gcc  -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard  -o test.out test.c

    But whether any other settings need to be done to enable NEON? 
    Why I dint find any performance difference between normal ARM and NEON codes?
    Let me know If I missed anything..

    Thanks in advance
Reply
  • Note: This was originally posted on 22nd March 2013 at http://forums.arm.com

    Hi,

    I executed NEON operation test on Linux platform board.
    (1)   Matrix multiplication: Method of  calculating one bye one.Here I have used only S registers. (Normal ARM instructions)


    (2)   Matrix multiplication: Since 128  bit calculation is done, the number of instructions will become 1/4 compared to  (1). Here I have used Q and D registers. (Neon instructions)


    I am using linux 3.0.35  and test code is executed on Linux platform (Cortex-a9 architecture) .
    But there is no speed difference between (1) and (2).


    In my Linux kernel configuration following options enabled
    CONFIG_VFP=y
    CONFIG_VFPv3=y
    CONFIG_NEON=y

    Following gcc command I have used to build the NEON application and  gcc compiler version is gcc 4.6.2
    gcc  -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard  -o test.out test.c

    But whether any other settings need to be done to enable NEON? 
    Why I dint find any performance difference between normal ARM and NEON codes?
    Let me know If I missed anything..

    Thanks in advance
Children
No data