This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
  • Note: This was originally posted on 30th November 2012 at http://forums.arm.com

    Hi Servin,
    Please see question,

    Right now I am not worried about the NOEN assembly code verse ARM assembly code.

    Right now my issue is, for simple way:

    I have one assembly code which I written by using ARM instructions. This is algo just do 1000 million of addition.  

    Please see the below code:

    res =loc_add_ARM(1000000000);

            ARM

            REQUIRE8

            PRESERVE8



            AREA ||.text||, CODE, READONLY, ALIGN=2

                                    global loc_add_ARM

    loc_add_ARM

            PUSH     {r4,r5,lr}

            MOV      r5,#1 ; val    

            MOV      r1,#0

            MOV      r2,#0

            MOV      r3,#0

            MOV      r4,#0

            MOV    r0,r0, asr #2       

    loc_add_ARM_LOOP

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5      

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5        

            SUBS     r0,r0,#4

            BGT      loc_add_ARM_LOOP

           

            add      r0,r1,r2

            add      r1,r3,r4

            add      r0,r1      

            ; res ->r0

            POP      {r4,r5,pc}

            END

    =============================================================

    To completed this operation it takes time  "800792"

    Then for my next experiment, I used the same ARM assembly code but just added on extra instruction NEON

    res =loc_add_ARM(1000000000);

            ARM

            REQUIRE8

            PRESERVE8



            AREA ||.text||, CODE, READONLY, ALIGN=2

                                    global loc_add_ARM

    loc_add_ARM

            PUSH     {r4,r5,lr}

            Veor.s32  q0,q0  ;; just added on extra instruction NEON

            MOV      r5,#1 ; val    

            MOV      r1,#0

            MOV      r2,#0

            MOV      r3,#0

            MOV      r4,#0

            MOV    r0,r0, asr #2       

    loc_add_ARM_LOOP

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5      

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5        

            SUBS     r0,r0,#4

            BGT      loc_add_ARM_LOOP

           

            add      r0,r1,r2

            add      r1,r3,r4

            add      r0,r1      

            ; res ->r0

            POP      {r4,r5,pc}

            END

    =============================================================

    But it give time as "895230"

    Why this increase in time due to one NEON instruction addition?

    Could you please help for this?

    Thanks,

    MJ



Reply
  • Note: This was originally posted on 30th November 2012 at http://forums.arm.com

    Hi Servin,
    Please see question,

    Right now I am not worried about the NOEN assembly code verse ARM assembly code.

    Right now my issue is, for simple way:

    I have one assembly code which I written by using ARM instructions. This is algo just do 1000 million of addition.  

    Please see the below code:

    res =loc_add_ARM(1000000000);

            ARM

            REQUIRE8

            PRESERVE8



            AREA ||.text||, CODE, READONLY, ALIGN=2

                                    global loc_add_ARM

    loc_add_ARM

            PUSH     {r4,r5,lr}

            MOV      r5,#1 ; val    

            MOV      r1,#0

            MOV      r2,#0

            MOV      r3,#0

            MOV      r4,#0

            MOV    r0,r0, asr #2       

    loc_add_ARM_LOOP

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5      

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5        

            SUBS     r0,r0,#4

            BGT      loc_add_ARM_LOOP

           

            add      r0,r1,r2

            add      r1,r3,r4

            add      r0,r1      

            ; res ->r0

            POP      {r4,r5,pc}

            END

    =============================================================

    To completed this operation it takes time  "800792"

    Then for my next experiment, I used the same ARM assembly code but just added on extra instruction NEON

    res =loc_add_ARM(1000000000);

            ARM

            REQUIRE8

            PRESERVE8



            AREA ||.text||, CODE, READONLY, ALIGN=2

                                    global loc_add_ARM

    loc_add_ARM

            PUSH     {r4,r5,lr}

            Veor.s32  q0,q0  ;; just added on extra instruction NEON

            MOV      r5,#1 ; val    

            MOV      r1,#0

            MOV      r2,#0

            MOV      r3,#0

            MOV      r4,#0

            MOV    r0,r0, asr #2       

    loc_add_ARM_LOOP

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5      

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5

            ADD      r1,r1,r5

            ADD      r2,r2,r5

            ADD      r3,r3,r5

            ADD      r4,r4,r5        

            SUBS     r0,r0,#4

            BGT      loc_add_ARM_LOOP

           

            add      r0,r1,r2

            add      r1,r3,r4

            add      r0,r1      

            ; res ->r0

            POP      {r4,r5,pc}

            END

    =============================================================

    But it give time as "895230"

    Why this increase in time due to one NEON instruction addition?

    Could you please help for this?

    Thanks,

    MJ



Children
No data