This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code

Note: This was originally posted on 27th November 2012 at http://forums.arm.com

I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.

Can you please explain me what could be the reason?

I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"

Please let me know is there anything I need to change the make file settings to get NEON performance improvement?

Parents

Mohamed Jauhar over 12 years ago

Note: This was originally posted on 30th November 2012 at http://forums.arm.com

Hi Servin,
Please see question,

Right now I am not worried about the NOEN assembly code verse ARM assembly code.

Right now my issue is, for simple way:

I have one assembly code which I written by using ARM instructions. This is algo just do 1000 million of addition.

Please see the below code:

res =loc_add_ARM(1000000000);

        ARM

        REQUIRE8

        PRESERVE8

        AREA ||.text||, CODE, READONLY, ALIGN=2

                                global loc_add_ARM

loc_add_ARM

        PUSH     {r4,r5,lr}

        MOV      r5,#1 ; val

        MOV      r1,#0

        MOV      r2,#0

        MOV      r3,#0

        MOV      r4,#0

        MOV    r0,r0, asr #2

loc_add_ARM_LOOP

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        SUBS     r0,r0,#4

        BGT      loc_add_ARM_LOOP



        add      r0,r1,r2

        add      r1,r3,r4

        add      r0,r1

        ; res ->r0

        POP      {r4,r5,pc}

        END

=============================================================

To completed this operation it takes time "800792"

Then for my next experiment, I used the same ARM assembly code but just added on extra instruction NEON

res =loc_add_ARM(1000000000);

        ARM

        REQUIRE8

        PRESERVE8

        AREA ||.text||, CODE, READONLY, ALIGN=2

                                global loc_add_ARM

loc_add_ARM

        PUSH     {r4,r5,lr}

        Veor.s32 q0,q0 ;; just added on extra instruction NEON

        MOV      r5,#1 ; val

        MOV      r1,#0

        MOV      r2,#0

        MOV      r3,#0

        MOV      r4,#0

        MOV    r0,r0, asr #2

loc_add_ARM_LOOP

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        SUBS     r0,r0,#4

        BGT      loc_add_ARM_LOOP



        add      r0,r1,r2

        add      r1,r3,r4

        add      r0,r1

        ; res ->r0

        POP      {r4,r5,pc}

        END

=============================================================

But it give time as "895230"

Why this increase in time due to one NEON instruction addition?

Could you please help for this?

Thanks,

MJ
Cancel
Vote up 0 Vote down

Cancel

Reply

Mohamed Jauhar over 12 years ago

Note: This was originally posted on 30th November 2012 at http://forums.arm.com

Hi Servin,
Please see question,

Right now I am not worried about the NOEN assembly code verse ARM assembly code.

Right now my issue is, for simple way:

I have one assembly code which I written by using ARM instructions. This is algo just do 1000 million of addition.

Please see the below code:

res =loc_add_ARM(1000000000);

        ARM

        REQUIRE8

        PRESERVE8

        AREA ||.text||, CODE, READONLY, ALIGN=2

                                global loc_add_ARM

loc_add_ARM

        PUSH     {r4,r5,lr}

        MOV      r5,#1 ; val

        MOV      r1,#0

        MOV      r2,#0

        MOV      r3,#0

        MOV      r4,#0

        MOV    r0,r0, asr #2

loc_add_ARM_LOOP

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        SUBS     r0,r0,#4

        BGT      loc_add_ARM_LOOP



        add      r0,r1,r2

        add      r1,r3,r4

        add      r0,r1

        ; res ->r0

        POP      {r4,r5,pc}

        END

=============================================================

To completed this operation it takes time "800792"

Then for my next experiment, I used the same ARM assembly code but just added on extra instruction NEON

res =loc_add_ARM(1000000000);

        ARM

        REQUIRE8

        PRESERVE8

        AREA ||.text||, CODE, READONLY, ALIGN=2

                                global loc_add_ARM

loc_add_ARM

        PUSH     {r4,r5,lr}

        Veor.s32 q0,q0 ;; just added on extra instruction NEON

        MOV      r5,#1 ; val

        MOV      r1,#0

        MOV      r2,#0

        MOV      r3,#0

        MOV      r4,#0

        MOV    r0,r0, asr #2

loc_add_ARM_LOOP

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        ADD      r1,r1,r5

        ADD      r2,r2,r5

        ADD      r3,r3,r5

        ADD      r4,r4,r5

        SUBS     r0,r0,#4

        BGT      loc_add_ARM_LOOP



        add      r0,r1,r2

        add      r1,r3,r4

        add      r0,r1

        ; res ->r0

        POP      {r4,r5,pc}

        END

=============================================================

But it give time as "895230"

Why this increase in time due to one NEON instruction addition?

Could you please help for this?

Thanks,

MJ
Cancel
Vote up 0 Vote down

Cancel

Children

No data