Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code
Jump...
Cancel
Locked
Locked
Replies
25 replies
Subscribers
119 subscribers
Views
16336 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code
Mohamed Jauhar
over 12 years ago
Note: This was originally posted on 27th November 2012 at
http://forums.arm.com
I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.
Can you please explain me what could be the reason?
I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"
Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
Mohamed Jauhar
over 12 years ago
Note: This was originally posted on 30th November 2012 at
http://forums.arm.com
Hi Servin,
Please see question,
Right now I am not worried about the NOEN assembly code verse ARM assembly code.
Right now my issue is, for simple way:
I have one assembly code which I written by using ARM instructions. This is algo just do 1000 million of addition.
Please see the below code:
res =loc_add_ARM(1000000000);
ARM
REQUIRE8
PRESERVE8
AREA ||.text||, CODE, READONLY, ALIGN=2
global loc_add_ARM
loc_add_ARM
PUSH {r4,r5,lr}
MOV r5,#1 ; val
MOV r1,#0
MOV r2,#0
MOV r3,#0
MOV r4,#0
MOV r0,r0, asr #2
loc_add_ARM_LOOP
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
SUBS r0,r0,#4
BGT loc_add_ARM_LOOP
add r0,r1,r2
add r1,r3,r4
add r0,r1
; res ->r0
POP {r4,r5,pc}
END
=============================================================
To completed this operation it takes time "800792"
Then for my next experiment, I used the
same ARM assembly code but
just added on extra instruction NEON
res =loc_add_ARM(1000000000);
ARM
REQUIRE8
PRESERVE8
AREA ||.text||, CODE, READONLY, ALIGN=2
global loc_add_ARM
loc_add_ARM
PUSH {r4,r5,lr}
Veor.s32 q0,q0 ;; just added on extra instruction NEON
MOV r5,#1 ; val
MOV r1,#0
MOV r2,#0
MOV r3,#0
MOV r4,#0
MOV r0,r0, asr #2
loc_add_ARM_LOOP
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
SUBS r0,r0,#4
BGT loc_add_ARM_LOOP
add r0,r1,r2
add r1,r3,r4
add r0,r1
; res ->r0
POP {r4,r5,pc}
END
=============================================================
But it give time as
"895230"
Why this increase in time due to one NEON instruction addition?
Could you please help for this?
Thanks,
MJ
Cancel
Vote up
0
Vote down
Cancel
Reply
Mohamed Jauhar
over 12 years ago
Note: This was originally posted on 30th November 2012 at
http://forums.arm.com
Hi Servin,
Please see question,
Right now I am not worried about the NOEN assembly code verse ARM assembly code.
Right now my issue is, for simple way:
I have one assembly code which I written by using ARM instructions. This is algo just do 1000 million of addition.
Please see the below code:
res =loc_add_ARM(1000000000);
ARM
REQUIRE8
PRESERVE8
AREA ||.text||, CODE, READONLY, ALIGN=2
global loc_add_ARM
loc_add_ARM
PUSH {r4,r5,lr}
MOV r5,#1 ; val
MOV r1,#0
MOV r2,#0
MOV r3,#0
MOV r4,#0
MOV r0,r0, asr #2
loc_add_ARM_LOOP
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
SUBS r0,r0,#4
BGT loc_add_ARM_LOOP
add r0,r1,r2
add r1,r3,r4
add r0,r1
; res ->r0
POP {r4,r5,pc}
END
=============================================================
To completed this operation it takes time "800792"
Then for my next experiment, I used the
same ARM assembly code but
just added on extra instruction NEON
res =loc_add_ARM(1000000000);
ARM
REQUIRE8
PRESERVE8
AREA ||.text||, CODE, READONLY, ALIGN=2
global loc_add_ARM
loc_add_ARM
PUSH {r4,r5,lr}
Veor.s32 q0,q0 ;; just added on extra instruction NEON
MOV r5,#1 ; val
MOV r1,#0
MOV r2,#0
MOV r3,#0
MOV r4,#0
MOV r0,r0, asr #2
loc_add_ARM_LOOP
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
ADD r1,r1,r5
ADD r2,r2,r5
ADD r3,r3,r5
ADD r4,r4,r5
SUBS r0,r0,#4
BGT loc_add_ARM_LOOP
add r0,r1,r2
add r1,r3,r4
add r0,r1
; res ->r0
POP {r4,r5,pc}
END
=============================================================
But it give time as
"895230"
Why this increase in time due to one NEON instruction addition?
Could you please help for this?
Thanks,
MJ
Cancel
Vote up
0
Vote down
Cancel
Children
No data