This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 1st December 2012 at http://forums.arm.com

    Thanks shervin for you reply .

    [font="Calibri"][size="3"]I don't know understand why NEON behaves as like coprocessor. [/size][/font]

    So in that case, I may have few instructions from ARM instruction sets in the billion count of NEON loop. For examples handing the loop count or index modifications.

    There will be lots of delay switch between the ARM and NEON.

    So I don't think it is correct, the NEON behave as coprocessor.

    [font="Calibri"][size="3"]please see the below  for loop,[/size][/font]

    =============================================================
    ;r3 ------------big value
    ;r0 -addrs 
    FORLOOP
    VLD1.16   {d0,d1,d2,d3},[r0],#32
    VQDMULL.S16  q4, d0, d1
    VQDMULL.S16  q5, d2, d3
    VST1.32   {q4,q5},[r2],#32
    SUBS   r3, r3, #32
    BGT    FORLOOP
    ================================================
    So in this for loop there is ARM instruction after NEON. So it will have the delay what you mentioned about NEON due to coprocessor


    Thanks,
    MJ
Reply
  • Note: This was originally posted on 1st December 2012 at http://forums.arm.com

    Thanks shervin for you reply .

    [font="Calibri"][size="3"]I don't know understand why NEON behaves as like coprocessor. [/size][/font]

    So in that case, I may have few instructions from ARM instruction sets in the billion count of NEON loop. For examples handing the loop count or index modifications.

    There will be lots of delay switch between the ARM and NEON.

    So I don't think it is correct, the NEON behave as coprocessor.

    [font="Calibri"][size="3"]please see the below  for loop,[/size][/font]

    =============================================================
    ;r3 ------------big value
    ;r0 -addrs 
    FORLOOP
    VLD1.16   {d0,d1,d2,d3},[r0],#32
    VQDMULL.S16  q4, d0, d1
    VQDMULL.S16  q5, d2, d3
    VST1.32   {q4,q5},[r2],#32
    SUBS   r3, r3, #32
    BGT    FORLOOP
    ================================================
    So in this for loop there is ARM instruction after NEON. So it will have the delay what you mentioned about NEON due to coprocessor


    Thanks,
    MJ
Children
No data