This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 29th November 2012 at http://forums.arm.com

    As you know I cannot share the code here, but I have done some other experiment, please see the below details. 

    I created few test cases, to understand details about time /cycles for ARM vs NEON on Coretx-A9 processor.

    Project 1-> which has two functions, the both function do 1000 million of addition.

    Function-1:  1000 million times of addition by using ARM instruction "loc_add_ARM".

    Function-2:  1000 million times of addition by using NEON instruction "loc_add_NEON".

    Please see time tick for the above two function in below table. I used the gettimeofday() function to get time in Cortex-A9  on our target .

    [size="3"][font="Calibri"]Function Name:   loc_add_ARM    :   (895230   - time)
                                       
    Function Name:   loc_add_NEON    :   (380375 - time)

    Project2-> In this case I have only enabled the function1 (1000 million times of addition by using ARM instruction).[/font][/size]

    Please see below time tick table for this case;


    Function Name:   loc_add_ARM    :   (800792 - time)
                                       
    Function Name:   loc_add_NEON    :   (not enabled / not called from the main function (0 - time) )



    Project3-> In this case I added one NEON instruction in function1 (function1 (1000 million times of addition by using ARM instruction)

    Please see the below table for this case;

    Function Name:   loc_add_ARM+1 NEON instruction    :   (895235- time)
                                       
    Function Name:   loc_add_NEON    :   (not enabled / not called from the main function (0 - time) )




    My question now, why there is a big time/cycle difference for the function "loc_add_ARM" in these three cases.
    Is it something related pipeline ?

    Thanks ,

    mj
Reply
  • Note: This was originally posted on 29th November 2012 at http://forums.arm.com

    As you know I cannot share the code here, but I have done some other experiment, please see the below details. 

    I created few test cases, to understand details about time /cycles for ARM vs NEON on Coretx-A9 processor.

    Project 1-> which has two functions, the both function do 1000 million of addition.

    Function-1:  1000 million times of addition by using ARM instruction "loc_add_ARM".

    Function-2:  1000 million times of addition by using NEON instruction "loc_add_NEON".

    Please see time tick for the above two function in below table. I used the gettimeofday() function to get time in Cortex-A9  on our target .

    [size="3"][font="Calibri"]Function Name:   loc_add_ARM    :   (895230   - time)
                                       
    Function Name:   loc_add_NEON    :   (380375 - time)

    Project2-> In this case I have only enabled the function1 (1000 million times of addition by using ARM instruction).[/font][/size]

    Please see below time tick table for this case;


    Function Name:   loc_add_ARM    :   (800792 - time)
                                       
    Function Name:   loc_add_NEON    :   (not enabled / not called from the main function (0 - time) )



    Project3-> In this case I added one NEON instruction in function1 (function1 (1000 million times of addition by using ARM instruction)

    Please see the below table for this case;

    Function Name:   loc_add_ARM+1 NEON instruction    :   (895235- time)
                                       
    Function Name:   loc_add_NEON    :   (not enabled / not called from the main function (0 - time) )




    My question now, why there is a big time/cycle difference for the function "loc_add_ARM" in these three cases.
    Is it something related pipeline ?

    Thanks ,

    mj
Children
No data