Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code
Jump...
Cancel
Locked
Locked
Replies
25 replies
Subscribers
119 subscribers
Views
16336 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Cortex-A9 : NEON assembly code is not giving expected performance compared with ARM assembly code
Mohamed Jauhar
over 12 years ago
Note: This was originally posted on 27th November 2012 at
http://forums.arm.com
I am facing one problem, like I have handmade ARM9 assembly code and NEON assembly code. I expected NEON assembly should get 4X % improvement for the speed compared with ARM assembly code. But I could not see that improvement in NEON assembly code.
Can you please explain me what could be the reason?
I am using Cortex-A9 processor and configuration in my Makefile : "CFLAGS=--cpu=Cortex-A9 -O2 -Otime --apcs=/fpic --no_hide_all"
Please let me know is there anything I need to change the make file settings to get NEON performance improvement?
Parents
Mohamed Jauhar
over 12 years ago
Note: This was originally posted on 29th November 2012 at
http://forums.arm.com
As you know I cannot share the code here, but I have done some other experiment, please see the below details.
I created few test cases, to understand details about time /cycles for ARM vs NEON on Coretx-A9 processor.
Project 1
-> which has two functions, the both function do 1000 million of addition.
Function-1: 1000 million times of addition by using ARM instruction "loc_add_ARM".
Function-2: 1000 million times of addition by using NEON instruction "loc_add_NEON".
Please see time tick for the above two function in below table. I used the gettimeofday() function to get time in Cortex-A9 on our target .
[size="3"][font="Calibri"]Function Name: loc_add_ARM : (895230 - time)
Function Name: loc_add_NEON : (380375 - time)
Project2->
In this case I have only enabled the function1 (1000 million times of addition by using ARM instruction).[/font][/size]
Please see below time tick table for this case;
Function Name: loc_add_ARM : (800792 - time)
Function Name: loc_add_NEON : (not enabled / not called from the main function (0 - time) )
Project3
-> In this case I added
one NEON
instruction in function1 (function1 (1000 million times of addition by using ARM instruction)
Please see the below table for this case;
Function Name: loc_add_ARM+1 NEON instruction : (895235- time)
Function Name: loc_add_NEON : (not enabled / not called from the main function (0 - time) )
My question now, why there is a big time/cycle difference for the function "loc_add_ARM" in these three cases.
Is it something related pipeline ?
Thanks ,
mj
Cancel
Vote up
0
Vote down
Cancel
Reply
Mohamed Jauhar
over 12 years ago
Note: This was originally posted on 29th November 2012 at
http://forums.arm.com
As you know I cannot share the code here, but I have done some other experiment, please see the below details.
I created few test cases, to understand details about time /cycles for ARM vs NEON on Coretx-A9 processor.
Project 1
-> which has two functions, the both function do 1000 million of addition.
Function-1: 1000 million times of addition by using ARM instruction "loc_add_ARM".
Function-2: 1000 million times of addition by using NEON instruction "loc_add_NEON".
Please see time tick for the above two function in below table. I used the gettimeofday() function to get time in Cortex-A9 on our target .
[size="3"][font="Calibri"]Function Name: loc_add_ARM : (895230 - time)
Function Name: loc_add_NEON : (380375 - time)
Project2->
In this case I have only enabled the function1 (1000 million times of addition by using ARM instruction).[/font][/size]
Please see below time tick table for this case;
Function Name: loc_add_ARM : (800792 - time)
Function Name: loc_add_NEON : (not enabled / not called from the main function (0 - time) )
Project3
-> In this case I added
one NEON
instruction in function1 (function1 (1000 million times of addition by using ARM instruction)
Please see the below table for this case;
Function Name: loc_add_ARM+1 NEON instruction : (895235- time)
Function Name: loc_add_NEON : (not enabled / not called from the main function (0 - time) )
My question now, why there is a big time/cycle difference for the function "loc_add_ARM" in these three cases.
Is it something related pipeline ?
Thanks ,
mj
Cancel
Vote up
0
Vote down
Cancel
Children
No data